Congratulations, your IT might be less sick today

help desk fire I came across an article in Computerworld titled “The Help Desk is Hot Again” articulating the revived popularity of the Help Desk. It explains that the Help Desk “serves as a vital liaison between employees’ mobile technologies and the networks, servers and applications that support them.” Help Desks certainly serve an important purpose however, this positioning feels slightly askew. For most IT organizations the Help Desk is where you go when you have a problem and need help. Help Desks do not understand how IT consumers are experiencing IT and are certainly not a liaison. I can see how there is a logical leap from issue management to evaluating the health of IT but do you go to the doctor when you are well?

Until recently, visibility into the consumer side of IT was not considered essential when measuring IT service availability. The assumption was that maniacally monitoring data center health provided enough data to show how effectively IT supported the business. For most organizations, IT availability and ‘end-user’ satisfaction is evaluated with metrics provided by the help desk, showing what went wrong and when. From the perspective of issues this may be acceptable but it hardly provides an accurate view on how the business is using IT. It would be like asking a doctor “so, how healthy does the world look today?,” where the answer would be “It looks pretty sick”.

This whole situation has been exacerbated by the use of mobile devices, the growth in non-corporate cloud-based application sources and the influx of people entering the industry who were born digital.  These new market entrants have learned to become more self-sufficient than any generation before and would rather have the flu than call the service desk. Many of today’s mobile issues are ‘fleeting’ with performance being a variable impacted by increasingly complex and congested network connectivity. For many, it’s easier just to wait it out.  Does the help desk capture this experience? No.

So, if the objective is to understand how IT is used and experienced, then you don’t start from the data center. The starting place is the IT consumer. This requires more than a set of tools giving visibility from ‘the edge,’ it will require IT support to organize and focus teams on IT consumer activity.  Measuring experience means understanding how IT is used, when it is used and where it is used, not just when there is an issue. Capturing, monitoring and analyzing IT consumer activity allows IT organizations to assess the true IT business impact, regardless of where the user is, what they are using or where their applications are sourced.

This approach is not going to be an easy for IT departments that have spent decades focusing on silo’d data center elements and back-end applications transactions. IT consumer activity monitoring is not an option. Users do not use one device, do not remain in one place and do not use just one application. IT innovation, mobility and IT consumer creativity will continue to push the limits of IT operations management with those able to adjust their IT management focus benefiting from greater IT decision making and business alignment.

The service desk must evolve to be a true high-touch solution and this can only be done when it is also used to monitor how all IT consumers are experiencing IT.   IT organizations that do not plan to focus on their IT consumers will be left struggling, trying to manage increasingly diverse IT needs using tools providing a datacenter centric application performance snapshot, stumbling their way towards the edge by trying to see through increasingly complex third-party service black-holes.

proactive sounds cool, but being reactive is just easier.

predisctiveRecently I’ve been involved in discussions about how new IT monitoring tools will make IT support teams smarter and far more proactive.  By smarter I mean having a greater understanding of IT health and by proactive I mean being aware of situations before or as they occur.

I’d argue that becoming smarter is a prerequisite to becoming proactive.  Monitoring for issues is so much easier when you know what you are looking for and understand the ramifications. The best way for IT support to become smarter is hire smartest, most experienced people. Becoming proactive is not so straight forward.

The idea that tools will make a reactive, crisis driven, IT operations team into a proactive one is nonsense. For decades monitoring tools have been able to set policy forewarning of events and giving support staff a heads-up on potential issues. The reasons this capability has not delivered on the promise are numerous; including events that are ‘potential issues’ or ‘warnings’ rarely classified as a high priority items, support staff not noticing them (or ignored them) or the method of event delivery being the wrong one.  It has had little to do with the monitoring tools. Reality is; most IT organizations are not measured on outage avoidance but on fixing issues once outages occur.

It’s easier to be the hero who got the order processing application back up than the person who said they had helped avoid the problem occurring in the first place (“you did what?” “oh sure you did, well done – help yourself to a medal”).   If an organization wants to be proactive then it needs to have people goaled and measured on finding issues before they become problems. Security officers actively monitor and analyze data to proactively identify anomalies,  irregular activity and behaviors, monitored events to stop hackers, cyber attacks, virus’ etc. Apparently, it is not acceptable to wait for security problems to occur before they get addressed.  For IT support to do this will require a number of changes including;

  1. an organization measured against outage avoidance.
  2. information delivered in ways that the support team will take notice of.
  3. information that means something and is actionable.

an organization measured against outage avoidance. An IT organization that prides itself on being proactive but measures itself against MTTR or MTBF is not fully proactive. The speed IT operations responds and fixes an issue is not a good measure of proactive efficiency without factoring in the speed issue was detected in the first place. IT operations effectiveness would have greater relevance if it was tied to outage avoidance.  This type of metric is not easy to capture using monitoring tool reporting (too many sources, limited business impact assessment) so it requires a way to immediately consolidate, log and track the identification to remediation process. The easiest way to do this is using a service desk.  This information would demonstrate how IT operations provides value, while showing increases in IT operational efficiencies.

information delivered in ways that the support team will take notice of. IT organizations invest a lot of time and effort trying to detect and process events, but few put the same effort into ensuring events are immediately delivered to the right IT personnel. A proactive state dictates that event data is delivered and owned as soon as it is detected. This means the mechanism chosen to deliver the data is as important as the effort associated with collecting the information in the first place. Most IT organizations still rely on event management tool consoles; however, an unwatched console will result in missed events. Sending events to mobile devices (e.g. in the form of an IM) and/or the use of alert notification tools can reduce the time it takes to become event-aware. Alert notification tools help support a proactive objective by automating the delivery of alerts to the appropriate IT operations personnel through the most-effective communications channel, in support of established escalation and outage procedures and also provide the mechanism for an event to be delivered, acknowledged and owned.

information that means something and is actionable. If you are not actively looking for something, it’s unlikely you’ll find it. A blindingly obvious statement but when monitors are being used in IT operations they are typically being used to aid root-cause-analysis on known reported issues where support knows there’s an issue and understands the sort of thing they need to look for. However, when there is no obvious problem it takes skills and experience to scroll through long lists of technical event data to identify the most critical, business impacting issues.  Knowing how things relate to the bigger picture requires the skill to assess the overall impact of multiple unassociated events and that means taking the yellow ones as seriously as the red ones.  This approach is the new way IT support must work, looking for subtle changes and behaviours in the IT infrastructure, applications and IT consumers, analyzing potential impacts and executing a plan to remediate the issue before it effects the business. This approach demands dedicating support personnel to IT analysis and moving them away from monitoring consoles when they have time or are motivated to do so by complaints from IT consumers.

If you ignore the price to the business, being reactive doesn’t cost a thing.

if NASA monitored like IT operations would they have made it to the moon?

rocket2In nearly every job I’ve had IT monitoring has been somewhere, either core to my day job or peripherally around the edge. Even though monitoring has been with us for decades it still attracts massive amounts of attention from IT organizations, vendors and Venture Capital. Red, green, yellow, yellow, green, red, how hard can it be?  There have been major shifts in finding new ways to understand the health of IT including; SNMP monitors in the early 1990’s and, more recently, the various flavors of APM products. For a software company to make a difference and successful selling a product in this space it really needs to innovate and provide something better. A lot better.  So I get tired when people say, “monitoring, it’s done isn’t it?”

It’s not. Not by a long long way.

Gartner published a report in May 2013 titled Market Share Analysis: IT Operations Management Software, Worldwide, 2012 (ID: G00249133). In this report it says that the 2012 application performance monitoring (APM) market is over $2 billion growing at 6.5% with the availability and performance monitoring market (IT infrastructure monitoring) being $2.8 billion growing at 7.6%. Even though these IT monitoring areas are considered separate market spaces the ideal is to combine them allowing IT organizations to understand the impact the IT infrastructure has on the applications and visa versa.  So when both areas are combined they become the largest IT management market segment with over 25% of the $18B total market. To put this into perspective the joint APM/Availability and Performance revenues (~$4.8B) is larger than configuration management, the second largest market segment, by over $1B which is also growing at a slower rate (6.3%).

Large. small, service provider, telco, SMB or enterprise, everybody has monitoring so the fact that it remains the highest growth IT management space is amazing. Even though it’s a huge market not dominated by a few vendors. It is a highly fragmented space with dozens of vendors and hundreds of tools.

Monitoring remains one of the most fragmented IT management spaces with tools from dozens of vendors ranging from $free to $hundreds of thousands. To remain relevant demands constant innovation with innovation coming from many areas including event collection, event consolidation, event processing, event reporting, ease of use, low complexity, high sophistication, product delivery, and product pricing and licensing. With the need to get clarity on IT services and also reduce the cost and effort to achieve it better ways to monitor are constantly being sought.

all monitoring is not the same
When people think of monitoring an image that comes to mind is of NASA and the way it monitors a moon launch. Dozens of people intensely looking at monitors anxiously looking for irregularities and working closely with all their colleagues to identify potential issues that may impact the success of the objective and the safety of the astronauts. Even though each person may have a different view of the health of the mission collaboration between the team members ensures that at an holistic view is understood at all times. Throughout the mission priorities change so does what and how each stage is monitored. In addition, the information displayed on the monitors is continually analyzed and correlated with other data with the objective to seek out potential issues that the individual monitoring displays may not make clear.  NASA monitors space missions with the assumption that something will go wrong demanding an immediate response to remediate the problem and ensure the success of the mission.

putting too much emphasis on the tools
For decades ITmonitoring ship professionals have used products to give them visibility into the health of the IT infrastructure which is monitored in fragmented piece parts with disparate non-collaborative teams all providing different vieship dragging astronautws on the health of IT. For many monitoring is accomplished when resources are available and unlike NASA most IT organizations assume everything is fine and look to monitoring to confirm a reported outage and to aid root-cause analysis.

IT organizations depend on tools to provide an understanding on the state of IT. Unfortunately IT continues to fragment and increase in complexity driving organizations to employ more monitoring tools in an attempt to gain clarity on overall IT health. However instead of making things easier to understand this creates additional challenges with each IT support organization providing increasingly different and potentially conflicting views on the health of the IT infrastrScreen shot 2013-07-24 at 10.49.14 AMucture. Some organizations using dozens of monitoring tools covering every aspect of their IT environment have no ability to clearly identify issues and the impact they have on the business. With each IT support team looking through different monitoring lenses the ability to gain and holistic trusted view becomes almost impossible.

avoid liability and attribute blame
When the business is impacted by an IT issue many organizations bring together the different IT support teams to help identify what the issue was, how it was detected and how to avoid the issue occurring again. Even though the senior IT executives do this to pacify and assure the business of IT’s competency and value each IT support organization will use their monitoring tools as evidence with which to prove either it was not their issue or show that the issue was identified and resolved in-line with company policy and service levels. This behavior changes monitoring from a proactive, issue avoidance practice to one where it is used to prove innocence and assign blame.

infrastructure availability does not equal application availability
IT problem optionsRoutinely IT support organizations use the statistics gathered by their monitoring tools to show effectiveness, IT availability and business value. Each IT component is monitored to a set of policies primarily derived  by how each IT team associates value to the components. The traditional 99.9% availability objective is still used by IT operations as a way to show IT availability. Unfortunately the business does not equate availability with how each component is functioning. IT availability is measured by the performance and availability of the applications and the support the IT organization provides. These two viewpoints on how IT value is measured creates confusion and conflict with IT support teams unable to comprehend the fact  that the business does not care about the individual health of each IT component. A business manager will assess the value of the IT organization based on the opinions and input of the  people who consumed the IT resource and not on a mountain of confusing, irrelevant technical detail that conflicts with  the IT consumer experience. In some cases this situation will drive the business to seek alternative IT providers for new applications and IT services.

how much are IT service quality problems costing business?
The reality is that while monitoring is employed in nearly every business that uses IT is not used effectively.  While tools for monitoring are designed to provide proactive warnings of issues the effectiveness of the tools can only be realized when they are used to show business impact augmented by an organization focused on proactive monitoring practices and collaborative team work. Being proactive requires more than just monitoring tools, it requires;

  1. an organization that actually seeks out issues
  2. information delivery mechanisms that the support teams will take notice of
  3. information delivered in meaningful ways, preferably associated with service levels and business impact

monitoring evolved
Even though monitoring continues to be updated it’s an evolution not a set of dramatic changes.  In the 1990s the focus was on the data center elements because for many that is where a majority of the IT resources were. Over time the need to understand how IT resources were being provided moved monitoring from basic availability to measuring performance and a set of processes and best practices to ensure specific outages and IT service degradations did not occur again.  More recently monitoring has evolved in multiple directions. The dynamic nature of the IT infrastructure demands that monitoring is able to keep up with constant change and business priorities.  This demand has created a new set of monitoring tools that dynamically discover IT components, establish relationships through various communication methods and dynamically map, in real-time, how IT resources are used in support of the changing needs of the business. The highly distributed and fragmented IT infrastructure created a demand for tools that can actively search and associate disparate data from disparate sources and then provide, through analysis, information on IT health that could not be achieved by the more traditional monitoring approaches.  And lastly, the way business consumes IT has forced many IT organizations to focus on the end-user experience.  Only by focusing on how end-users consume IT resources will the IT organization be able to fully understand and support the business.

Summarizing all this…
IT and business are synonymous. Monitoring IT like it’s a network and a bunch of servers going to result in the business demanding more relevant and accurate service measurement – specific to applications availability and performance and IT consumer experience.  The critical impact IT has on business means executives continually evaluate the support and services provided by the IT organization and assess ways for improvement.  For business IT value is a very easy metric to measure; availability, performance, responsiveness, flexibility and support. In addition, IT consumers have become major influencers of how IT services are evaluated, delivered and consumed demanding a different view to understand the health of IT services.  As IT consumers use IT resources beyond the corporate data center the value of IT is assessed as an overall experience no matter where applications are sourced, what access methods are used or where support is located. The only way to fully understand how the business views IT services is to monitor how IT consumers use IT.

High volumes of disparate event data creates confusion and conflict demanding technology that consolidates, correlates and prioritizes issues aligned with how the business consumes IT services.
IT organizations will still use tools that monitor specific IT elements as these allow specialists to have a greater/deeper understanding providing the ability to identify a problem’s root cause however, these types of monitoring tools are used as event sources feeding monitoring products able to consolidate, filter, correlate and prioritize issues in line with IT service delivery. The ability to achieve this objective demands technology that can easily integrate and associate data into information relevant to both the IT organization and the business.

IT Infrastructure monitoring. Red, green, yellow is no longer enough.

health headA view on the health of the IT infrastructure is accomplished using monitoring tools – lots of them. This has been the approach for decades with differences revolving around how the data is collected (the age old agent vs. agentless argument), integration provided, how data is processed, how the tools are purchased, and increasingly creative ways to display red, green and yellow.  However, it doesn’t matter if you use a high-cost, low-cost or no-cost monitoring tool the objective remains the same – get a view of the health of IT.

IT infrastructure monitoring is not glamorous but it is required, how else is IT operations going to confirm an issue reported through the service desk? However, the way monitoring is used today is not suitable for many of the requirements for monitoring moving forward.

Monitoring is splitting into two distinct approaches and depending on what you need from monitoring will determine how and what tools are used.  The first approach is the traditional one, monitoring IT health. The second is using monitoring to enable an action which means; collecting and analyzing specific information and  using it to support an automation procedure or running an action.

An example of the second approach is monitoring the performance of a cloud IT infrastructure stack where the objective is not simply to understand the health of the cloud environment but enable capacity to be dynamically allocated and changed in-line with usage or need (aka cloud elasticity). Add to this the fact that cloud environments are moving from server and storage capacity to application services then the ability to make changes in the cloud becomes far more complex. E.g. making a change to a cloud database may make sense for one application but have a detrimental impact on others.

Even though traditional monitoring performance tools are being used to provide a view on cloud health their ability to support decision making is problematic (see diagram 1).

monitoring diagram11. Performance policy is defined within each monitoring tool focused on specific element and element type thresholds – not on overall cloud service performance.
2. Performance monitoring data does not show how one element’s performance impacts another (e.g. how changing a server or network configuration impacts multiple applications) – creating an inability to make trusted changes.
3. Challenges in pulling together (in real-time) multiple performance feeds into a coherent service or application view – creating a ‘lag’ in making changes and the need for multiple tools and teams to be involved.

Integrating multiple performance feeds to assess overall application/service impact requires a highly sophisticated performance consolidation tool that normalizes, consolidates, filters data and provides an accurate service impact that can be used to support or trigger an action.  This tool does not exist.

However, there are sophisticated capacity tools able take performance data from multiple performance sources and optimize IT resource as a service (e.g. BMC’s BCO product). The best results are achieved when the data received from the supporting performance tools focuses specifically on the environment being provisioned/updated.  This enables services to be changed (e.g. orchestrated through a service governor) with greater accuracy (e.g. supporting service placement or making decisions on requested changes in context of impact). monitoring diagram2

The future of IT infrastructure monitoring includes processing specific information to make trusted decisions.    For example; cloud monitoring will have policy derived from the cloud blueprint (cloud service component architecture) with possible input from other sources (e.g. a service catalog to guide service levels) see diagram 2.  This will result in one set of policy aimed specifically at the IT components supporting the cloud services to both assess cloud health and provide the actionable information needed to make safe changes. This differs from the traditional monitoring I mentioned previously which collects data from everything and then tries to apply filters and rules to reduce content to provide a true view of IT health.

Focusing on an outcome by monitoring a specific set of components allows the capacity tool to provide accurate placement decisions that can be executed through the governor and the provisioning and configuration tools.

Automated cloud decision making is just one example of the way monitoring is evolving. The same value could be attributed to any IT infrastructure automation initiative including agile development practices (e.g. DevOps).

Today we are at a crossroads. IT operations tools developed to monitor IT infrastructure health are increasingly being considered to provide highly accurate information to support automated decision-making . Even though this re-purposing can be achieved,  the effort, cost and complexity is going to be prohibitive and reminds me of a line from an old Irish joke, “sir to get there, I wouldn’t have started off from here”.

It’s not as if a totally new set of tools is required, although for the cloud IT decisions may be provided from monitoring embedded in cloud management solutions.

monitoring xls

Diagram 3

Diagram 3 describes 5 areas of differentiation between tools used for monitoring health and one’s designed/used for aiding decisions.  The most important differentiation being; the objective. This dictates policy, the environment monitored and the integrations required.  If you want to make decisions you set policy based on the decision being made, if you want to check for infrastructure health you set policy based on component thresholds.

Just when you thought monitoring was already complex. It’s about to get more interesting.

do IT users care about the datacenter?

datacenterThe datacenter – the IT business hub. Or it used to be.

End users could not care less about it. What they care about is applications availability and response times and the ability to get IT access from whatever device they choose and from wherever they want. The business increasingly makes decisions on what applications are used, where the applications are sourced and who supports them. There’s no love affair between the business and the organization called IT operations because it’s not about technology it’s about getting the job done. It’s not that datacenter availability isn’t important it’s just not important to users – the business measures IT value against the quality of support and applications availability and performance not servers, storage and networks.

Some will struggle with how monitoring the datacenter does not equate to understanding and measuring business availability. It was not so long ago companies providing datacenter outsourcing services would have a huge display in reception with topology maps showing a red, green, yellow status of the datacenter infrastructure. I can only assume it was designed to show control and understanding because I’d argue the computer room could spontaneously combust and no-one would be any the wiser until the end-users reported problems accessing their applications.

How many times have you thought “I wonder if the servers are performing well today?” or  “I hope my files are backed up and secure”.  What you probably think is “email is slow, IT need to fix it now” and if data is lost or corrupted “IT had better get it back now”.  My point is this, the datacenter will continue to be critical to the IT organization responsible for managing it – not to the businesses that use it. For the business it’s all about the application – no matter where it resides or who manages it and the fact an application requires hardware and software to live is, from the user perspective, is irrelevent. It’s assumed.

In late December 2012 Netflix had issues. The fact it was over a holiday period made the problem even more annoying. It was a Netflix problem and twitter lit up with customer feedback for Netflix. Netflix blamed the issue on Amazon Web Services servers and said Amazon was addressing it. So, that’s ok then? It’s not a Netflix application problem – it’s a an Amazon server problem.  It doesn’t matter if Amazon’s servers were the real problem, it is Netflix’s job to make sure their applications are not plagued by a weakness in server capacity, performance, architecture or design – no matter who they decided to source this critical task to. Subscribers to Netflix do not pay Amazon.

It’s the same for any IT organization delivering IT application services whether they are internal or external to a business. Monitoring the datacenter to identify and solve issues is one thing – using the same element monitoring to try and demonstrate value to the business is another.  Managing the datacenter is mandatory, however using element based availability metrics as proof of IT business value and application availability is no longer acceptable.

From a business perspective the value of IT is assessed through the lens of their business users – not the datacenter.  This will increasingly result in IT value being assessed from the end user to the application source measured against services levels which means datacenter components can go up and down all they like as long as it doesn’t have a detrimental affect on business service levels.  With the growing trend to use applications from cloud based service providers who can tell where all the parts of an application are?  Netflix is hardly unique, architecturally, in the way it provides services. As more applications are made available in the cloud the location of the supporting infrastructure is likely to be in the hands of one or more additional cloud service providers.

So, who cares about the datacenter? The people responsible for managing it, developers, testers and business unit personnel who pay for capacity. For users and the business  – it’s all about the application.