Recently I’ve been involved in discussions about how new IT monitoring tools will make IT support teams smarter and far more proactive. By smarter I mean having a greater understanding of IT health and by proactive I mean being aware of situations before or as they occur.
I’d argue that becoming smarter is a prerequisite to becoming proactive. Monitoring for issues is so much easier when you know what you are looking for and understand the ramifications. The best way for IT support to become smarter is hire smartest, most experienced people. Becoming proactive is not so straight forward.
The idea that tools will make a reactive, crisis driven, IT operations team into a proactive one is nonsense. For decades monitoring tools have been able to set policy forewarning of events and giving support staff a heads-up on potential issues. The reasons this capability has not delivered on the promise are numerous; including events that are ‘potential issues’ or ‘warnings’ rarely classified as a high priority items, support staff not noticing them (or ignored them) or the method of event delivery being the wrong one. It has had little to do with the monitoring tools. Reality is; most IT organizations are not measured on outage avoidance but on fixing issues once outages occur.
It’s easier to be the hero who got the order processing application back up than the person who said they had helped avoid the problem occurring in the first place (“you did what?” “oh sure you did, well done – help yourself to a medal”). If an organization wants to be proactive then it needs to have people goaled and measured on finding issues before they become problems. Security officers actively monitor and analyze data to proactively identify anomalies, irregular activity and behaviors, monitored events to stop hackers, cyber attacks, virus’ etc. Apparently, it is not acceptable to wait for security problems to occur before they get addressed. For IT support to do this will require a number of changes including;
- an organization measured against outage avoidance.
- information delivered in ways that the support team will take notice of.
- information that means something and is actionable.
an organization measured against outage avoidance. An IT organization that prides itself on being proactive but measures itself against MTTR or MTBF is not fully proactive. The speed IT operations responds and fixes an issue is not a good measure of proactive efficiency without factoring in the speed issue was detected in the first place. IT operations effectiveness would have greater relevance if it was tied to outage avoidance. This type of metric is not easy to capture using monitoring tool reporting (too many sources, limited business impact assessment) so it requires a way to immediately consolidate, log and track the identification to remediation process. The easiest way to do this is using a service desk. This information would demonstrate how IT operations provides value, while showing increases in IT operational efficiencies.
information delivered in ways that the support team will take notice of. IT organizations invest a lot of time and effort trying to detect and process events, but few put the same effort into ensuring events are immediately delivered to the right IT personnel. A proactive state dictates that event data is delivered and owned as soon as it is detected. This means the mechanism chosen to deliver the data is as important as the effort associated with collecting the information in the first place. Most IT organizations still rely on event management tool consoles; however, an unwatched console will result in missed events. Sending events to mobile devices (e.g. in the form of an IM) and/or the use of alert notification tools can reduce the time it takes to become event-aware. Alert notification tools help support a proactive objective by automating the delivery of alerts to the appropriate IT operations personnel through the most-effective communications channel, in support of established escalation and outage procedures and also provide the mechanism for an event to be delivered, acknowledged and owned.
information that means something and is actionable. If you are not actively looking for something, it’s unlikely you’ll find it. A blindingly obvious statement but when monitors are being used in IT operations they are typically being used to aid root-cause-analysis on known reported issues where support knows there’s an issue and understands the sort of thing they need to look for. However, when there is no obvious problem it takes skills and experience to scroll through long lists of technical event data to identify the most critical, business impacting issues. Knowing how things relate to the bigger picture requires the skill to assess the overall impact of multiple unassociated events and that means taking the yellow ones as seriously as the red ones. This approach is the new way IT support must work, looking for subtle changes and behaviours in the IT infrastructure, applications and IT consumers, analyzing potential impacts and executing a plan to remediate the issue before it effects the business. This approach demands dedicating support personnel to IT analysis and moving them away from monitoring consoles when they have time or are motivated to do so by complaints from IT consumers.
If you ignore the price to the business, being reactive doesn’t cost a thing.