proactive sounds cool, but being reactive is just easier.

predisctiveRecently I’ve been involved in discussions about how new IT monitoring tools will make IT support teams smarter and far more proactive.  By smarter I mean having a greater understanding of IT health and by proactive I mean being aware of situations before or as they occur.

I’d argue that becoming smarter is a prerequisite to becoming proactive.  Monitoring for issues is so much easier when you know what you are looking for and understand the ramifications. The best way for IT support to become smarter is hire smartest, most experienced people. Becoming proactive is not so straight forward.

The idea that tools will make a reactive, crisis driven, IT operations team into a proactive one is nonsense. For decades monitoring tools have been able to set policy forewarning of events and giving support staff a heads-up on potential issues. The reasons this capability has not delivered on the promise are numerous; including events that are ‘potential issues’ or ‘warnings’ rarely classified as a high priority items, support staff not noticing them (or ignored them) or the method of event delivery being the wrong one.  It has had little to do with the monitoring tools. Reality is; most IT organizations are not measured on outage avoidance but on fixing issues once outages occur.

It’s easier to be the hero who got the order processing application back up than the person who said they had helped avoid the problem occurring in the first place (“you did what?” “oh sure you did, well done – help yourself to a medal”).   If an organization wants to be proactive then it needs to have people goaled and measured on finding issues before they become problems. Security officers actively monitor and analyze data to proactively identify anomalies,  irregular activity and behaviors, monitored events to stop hackers, cyber attacks, virus’ etc. Apparently, it is not acceptable to wait for security problems to occur before they get addressed.  For IT support to do this will require a number of changes including;

  1. an organization measured against outage avoidance.
  2. information delivered in ways that the support team will take notice of.
  3. information that means something and is actionable.

an organization measured against outage avoidance. An IT organization that prides itself on being proactive but measures itself against MTTR or MTBF is not fully proactive. The speed IT operations responds and fixes an issue is not a good measure of proactive efficiency without factoring in the speed issue was detected in the first place. IT operations effectiveness would have greater relevance if it was tied to outage avoidance.  This type of metric is not easy to capture using monitoring tool reporting (too many sources, limited business impact assessment) so it requires a way to immediately consolidate, log and track the identification to remediation process. The easiest way to do this is using a service desk.  This information would demonstrate how IT operations provides value, while showing increases in IT operational efficiencies.

information delivered in ways that the support team will take notice of. IT organizations invest a lot of time and effort trying to detect and process events, but few put the same effort into ensuring events are immediately delivered to the right IT personnel. A proactive state dictates that event data is delivered and owned as soon as it is detected. This means the mechanism chosen to deliver the data is as important as the effort associated with collecting the information in the first place. Most IT organizations still rely on event management tool consoles; however, an unwatched console will result in missed events. Sending events to mobile devices (e.g. in the form of an IM) and/or the use of alert notification tools can reduce the time it takes to become event-aware. Alert notification tools help support a proactive objective by automating the delivery of alerts to the appropriate IT operations personnel through the most-effective communications channel, in support of established escalation and outage procedures and also provide the mechanism for an event to be delivered, acknowledged and owned.

information that means something and is actionable. If you are not actively looking for something, it’s unlikely you’ll find it. A blindingly obvious statement but when monitors are being used in IT operations they are typically being used to aid root-cause-analysis on known reported issues where support knows there’s an issue and understands the sort of thing they need to look for. However, when there is no obvious problem it takes skills and experience to scroll through long lists of technical event data to identify the most critical, business impacting issues.  Knowing how things relate to the bigger picture requires the skill to assess the overall impact of multiple unassociated events and that means taking the yellow ones as seriously as the red ones.  This approach is the new way IT support must work, looking for subtle changes and behaviours in the IT infrastructure, applications and IT consumers, analyzing potential impacts and executing a plan to remediate the issue before it effects the business. This approach demands dedicating support personnel to IT analysis and moving them away from monitoring consoles when they have time or are motivated to do so by complaints from IT consumers.

If you ignore the price to the business, being reactive doesn’t cost a thing.

2 thoughts on “proactive sounds cool, but being reactive is just easier.

  1. Great post! You are really hitting on something deep and meaningful here. For so long monitoring tools have promised and never delivered proactive operations. And you are so right when you point to culture and incentives as the main issues. When you look at new DevOps shops, they are directly tying success to business critical measurements like application uptime, user experience, etc. Delivering top-notch customer value and experience should be central to IT, but it rarely is.

    I do think you missed one point, though. A big part of the problem is the main IT shops process metrics. First, instead of looking at trends (see Etsy for the premiere example of that), they look for events, and ALWAYS get overwhelmed with false positives. Trends are better at indicating problems than spending lots of time trying to identify problems to watch. Second, it is the unknowns are the ones that kill you, not the known ones. Monitoring systems needs to improve at Anomaly detection, and not force you to pre-identify what to watch beforehand.

  2. This is a great post. I work for ExtraHop, a monitoring vendor. But for people to buy our solution and think that their IT operations are going to improve magically is wrong. Our solution, which analyzes wire data, is helps IT organizations make those cultural, organizational, and procedural changes that are required for more mature IT operations. Gartner’s Infrastructure and Operations Maturity Model says there are three legs to the stool: people, technology, and process. Your post does a great job talking about the people and process parts.

Leave a Reply

Your email address will not be published. Required fields are marked *