Enterprise Edge Computing Model – Part 1: Common Attributes

What the Edge Requires.

An Edge environment exhibits attributes that differentiates it from other environments such as cloud or on-premise data centers.  Products used to enable enterprise Edge computing must ensure that the attributes are met. The following are seven common attributes specific to Enterprise Edge Computing.

Each attribute area contains values that fulfill the attribute. The values are needed to meet the attributes. The following is a list of the Edge common attributes, the values and supporting value details.

Lights Out 

  • Locations function without local human resources
    • IT skills and resources optimized 
    • IT personnel placed where the business needs them
    • Edge locations added without requiring an increase in skills and resources 
  • Locations monitored and controlled remotely 
    • Edge managed and monitored using tools that do not require local support
    • Edge facilities monitored and managed as an holistic environment
    • Change at the Edge managed, monitored and guided remotely 
  • Edge environment and infrastructure monitored in real-time, no blind spots
    • Software applications, databases, networks, infrastructure and environment monitored in-line with Edge service delivery 
    • Physical access and local activity monitored 
    • Software, database, hardware and environment monitoring unified to show overall Edge status 

High Performance

  • High performing data communication to/from the highly distributed Edge environment 
    • Edge software, environment and infrastructure performance monitored to support the end-to-end IT service delivery metrics 
    • Environmental conditions at the Edge aligned with the impact on IT services 
    • Edge performance evaluated against SLA’s and cross Edge facility comparisons
  • Edge equipment fully supported without requiring customization
    • Service performance efficiencies delivered through the elimination of custom work  
    • Personnel performance increased through the elimination of manual activity and the unification of monitored data 
    • Environment performance delivered by unifying disparate conditions and automatically evaluating the impact on the IT services delivered at the Edge 
  • High availability, no single point of failure 
    • Edge environment has high-availability and failover built in to mitigate the risk of failure
    • Edge status information delivered even when outages to the infrastructure and environment occur 
    • Monitoring tools do not stop working when the Edge experiences network or power degradation

Remote Control

  • Remote control and monitoring of IT assets and environmental systems
    • Infrastructure, software and environmental systems controlled and monitored without requiring local personnel
    • Edge infrastructure and environmental systems that do not have remote monitoring capabilities are monitored with solutions that can integrate with them
    • Conditions requiring a human presence at the Edge facility are enabled through monitoring and supported with remote guidance
  • Environmental conditions monitored with alerts created when exceptions are detected
    • Conditions are consolidated to provide an overall environmental Edge state
    • Environmental conditions are correlated to provide cause and impact data 
    • Environmental monitoring is integrated with service management solutions
  • Asset movement and access to facility monitored and tracked
    • Assets are tracked in in real-time to their physical (actual) location
    • Network outages and power disruption should not prevent assets from being tracked 
    • Environmental conditions are associated with the impacted assets 

Secure

  • Connectivity to and within Edge secured in-line with industry and corporate policy 
    • Edge environment managed against change management policy and controls 
    • Physical changes at the Edge accomplished under remote supervision to reduce risk 
    • Change activity recorded and logged 
  • Edge management tools provide and enforce access permissions for different roles 
    • Edge tools must have roles-based access 
    • Edge monitoring dashboards customized around roles 
    • Monitoring and reporting data delivered with access controls and granular permissions
  • Real-world view of physical access and subsequent activity at all Edge locations
    • Controlled and monitored access to Edge locations 
    • Recorded video of Edge facility activity 
    • Alerts created when movement is detected

End User / Consumer Focused

  • Consumer experience captured and used to measure Edge success
    • Constant feedback on application availability, performance and response times
    • Issues managed in-line with SLA’s with feedback on responsiveness and impact 
    • Tools ensure business activity is factored into real-time Edge status monitoring
  • Edge managed and supported in-line with the business service levels
    • Performance and availability managed in-line with service levels
    • Environment and infrastructure relationships mapped to manage environmental impact on business assets  
    • Support priorities dynamically adjust inline with business activity
  • Continual location comparisons enable efficiency and performance change and identify and eliminate issues and weaknesses
    • Enterprise Edge performance evaluated to assess high and low performing locations  
    • Edge infrastructure and environmental system comparisons assessed to identify high and low performing vendor equipment  
    • Location capacity and consumption monitored to identify areas of improvement and optimization 

Heterogeneous

  • Edge managed consistently irrespective of technology type or vendor
    • Holistic Enterprise Edge infrastructure and environmental conditions monitored   
    • Information normalized from multiple sources 
    • Edge state known with issues addressed using a single source of truth 
  • Out-of-the-box management support for all Edge systems
    • Edge systems managed and monitored without customization 
    • Out of the box support for all new and updated Edge equipment 
    • No migration impact when changing Edge systems 
  • Elimination/abstraction of proprietary hardware monitoring 
    • Single source of truth across all infrastructure and environmental systems
    • Monitoring data normalized and reported  
    • Elimination of multiple displays and dashboards  

High Scale

  • Management scales to hundreds or thousands of different size locations
    • Edge environment scale managed without impact or an increase in complexity 
    • Edge environment scale managed without service degradation 
    • Maps and dashboards allow scale to be accomplished without impacting visualization and an ease of understanding 
  • Locations added and removed without impacting business 
    • Change accomplished without an impact on IT services
    • Management abstraction allows freedom of choice for Edge systems
    • New Edge systems and locations added with low cost, low impact and no additional skills 
  • Performance delivered consistently across all locations 
    • Performance of Edge monitoring unaffected by growth within Edge locations 
    • Performance of Edge monitoring unaffected by growth of Edge locations 
    • Monitoring performance not impacted by Edge systems diversity 

Edge Computing products are designed to meet the requirements of the Edge with functionality that meets common Edge attributes.

Common Edge AttributesEdge Solution Requirement 
Lights Out 
Locations function without local human resources. Locations monitored and controlled remotely. 
Edge environment and infrastructure monitored in real-time, no blind spots.
Single, trusted view of all remote Edge locations. Edge monitoring delivered even if the network is having issues. Remote control of servers, networks and the physical infrastructure.
Automation used where possible to remove, local, manual activity.
The ability to provide remote guided maintenance and support to personnel visiting the Edge facility.
Live video of the Edge facility triggered upon activity being detected.  
High Performance
High performing data communication to/from the highly distributed.
Edge environment fully supported without requiring customization.
High availability, no single point of failure.
Solution architecture developed to collect, process and log massive amounts of data, at scale, without a performance impact across entire Edge infrastructure. Continuous processing of all asset location data providing immediate awareness and tracking of movement activity. Continuous processing of all key environmental conditions providing immediate awareness of the impact and root-cause of issuesIntegration with service management solutions ensuring support service levels are met.
Remote Control
Remote control and monitoring of IT assets and environmental systems.
Environmental conditions monitored with alerts created when exceptions are detected.
Edge asset movement tracked.
Access to facility monitored and tracked.
Products monitor infrastructure health, motion detection, asset movement, facility access and environment change.
Control to all Edge and data center locations
Low/no touch technology delivering a holistic view of the Edge infrastructure environment and computer equipment even when there’s an impact on the remote locations power and network. 
Secure
Connectivity to and within each Edge location secured in-line with industry and corporate policy.
Role based management tools permissions.
Real-world view of physical access and subsequent activity at all Edge locations.
A single, trusted source, to monitor access at Edge locations, recording activities and providing video and alerts when racks are opened and/or assets are moved. Scheduled maintenance activity monitored and guided remotely. Real-time monitoring of systems, networks and databases. Holistic view of all security issues and automated checks on other locations to check for similar issues or patterns.
End User / Consumer Focused
Consumer experience captured and used to measure Edge success. Edge managed and supported in-line with the business service levels. Continual location comparisons enable efficiency and performance change and identify and eliminate issues and weaknesses
Infrastructure and environmental tracked to show variations from normal conditions high-lighting potential business impacting issues.
Edge locations monitored holistically with data that allow comparisons to be made between poor and well performing locations.  
Infrastructure, applications and environmental issue impacts managed in line with business severity.  Environmental conditions mapped directly with the business assets effected allowing business impact to be assessed and remediated in-line with SLA’s.
Heterogeneous
Edge managed consistently irrespective of technology type or vendor.
Out-of-the-box management support for all Edge systems.
Monitoring of all types of software, database
and hardware. 
Provides an abstraction layer, monitoring, collecting and processing environmental power and cooling data, irrespective of what equipment is used.
Zero-impact, zero-cost monitoring when new environmental equipment is introduced.   
Eliminates the silos that are created when monitoring equipment by type or vendor.
All Edge facility assets tracked agnostically, no customization, no network.   
Integration with all leading Systems, Network, Service Mgmt. and DCIM products providing a comprehensive view of the Edge ensuring SLA’s are met within each function.
High Scale
Management scales to hundreds or thousands of different size locations. Locations added and removed without impacting business. 
Performance delivered consistently across all locations.
Edge environment monitored and viewed holistically and at massive scale without disruption or delay.
Solution architecture developed to manage highly diverse, highly dispersed, environments.
No single point of failure, high-availability required as part of Edge management tool architecture
 

The Edge Common Attributes are table-stakes. A product that does not exhibit the Edge Common Attributes is not an Edge product. Part 2 describes the attributes that can differentiate Edge management products. 

Enterprise Edge Computing Model – Introduction

This is to introduce the Enterprise Edge Computing Model. The goal is to provide a foundational understanding of Enterprise Edge computing requirements, objective values, product relevance and positioning. The model serves to help enterprise organizations evaluate, assess and plan an Edge progression path.

Introduction

Edge computing changes how IT services are delivered, and this has changed how products are chosen, delivered and used. Enterprise Edge computing demands management products are evaluated on more than features and functions.

The impact of Edge computing is transformational. It requires IT organizations to evaluate how they are organized, the changes in process, the relevance of existing technology and most importantly, the ability to deliver business value.  The immediate challenge for Edge computing is it is diverse, confusing and undefined. Describing it in high-level abstract or generic terms is worthless as it provides little clarity and promotes vendor ‘edge washing’ where all products can be creatively positioned to address Edge.

The enterprise Edge computing model has been made possible through discussion and input with enterprise companies, analysts and industry experts.

The focus for the Enterprise Edge Computing Model is enterprise organizations that have multiple IT remote locations. The remote locations can be referred to as ‘Lights Out’ Edge datacenter consisting of a small equipment rack in multiple remote locations or multiple large distributed data centers. For many enterprises, Edge computing environments are diverse, non-standard, requiring new organizational models, sophisticated software application architectures and a high level of abstraction to visualize, deliver low touch control and the ability to scale and manage a heterogenous, mix of equipment. 

The three Enterprise Edge Computing Model posts deliver the following content:

  1. Enterprise Computing Edge Model – Common Attributes
  2. Enterprise Computing Edge Model – Product Values
  3. Enterprise Computing Edge Model – Edge Transformation

Enterprise Edge Computing Challenges and Opportunity

 Each Enterprise Edge computing use-case type (Different Flavors of Edge Computing) presents different challenges and opportunities. As mission critical applications and data are spread to the Edge it introduces scale and performance challenges requiring Edge solutions to function efficiently and reliably no matter how many Edge instances a client introduces. Each Edge computing use-case has its own unique challenges and opportunities, but the following are, to varying degrees, common to each Edge computing type. 

Scale

  • Challenges: The number and distribution of Edge computing locations makes administration, monitoring, data management, security, status understanding and visualization of the Edge a challenge both technically and organizationally.  
  • Opportunity: New intelligent software architectures are needed that holistically manage highly distributed heterogeneous environments, accomplished by collecting, unifying and processing data from a wide range of sources. This management abstracted layer delivers simplified ‘low touch’ monitoring and administration to eliminate the need for human interaction. 

Performance

  • Challenges: Monitoring and managing the Edge performance to/from the consumer/end-point and to/from the Edge to the cloud/data center. 
  • Opportunity: Technology providing an end-to-end view, internal to, and to/from the Edge. This includes transactional performance to the end-point, the data flow from the Edge to the Cloud / Data Center and the performance of the data repositories in all locations. 

Control

  • Challenges: Edge locations run ‘lights out’ resulting in challenges to manage physical access, control IT equipment, manage the environment (power/cooling), track equipment and assess, isolate and remediate issues especially when the network is impacted or unavailable. 
  • Opportunity: Motion detection with low/no touch technology providing a holistic view and control of the Edge infrastructure environment and computer equipment that can function, at acceptable levels, even when there’s an impact on the power and network.

Security

  • Challenges: Aspects of control (access), and all areas of access, threat and data security.
  • Opportunity:  Security solutions providing a unified view of physical, network, access and data, with automated responses or a security service that provides ‘total’ security without needing the client’s involvement. 

Organization

  • Challenges: Managing Edge, the same way as a traditional data center with different teams responsible for ‘slices’ of the infrastructure creates significant inefficiencies in respect to support, costs, skills, resources and business availability. 
  • Opportunity: Technology that views the Edge holistically, consolidates and unifies data from multiple sources, turning data into information that is relevant to multiple roles. Providing mechanisms for collaboration and information sharing ensuring everyone responsible for the Edge is aware of Edge state. Advanced correlation, AI and analytics used to make information actionable and support automated activities.  

Heterogeneity 

  • Challenges: Infrastructure diversity (different IT and environment equipment) creates fragmented visualization, control and management resulting in high costs (including: skills, resources, resolution time) and the creation of Edge equipment ‘silo’s’.  Complexity will increase exponentially with every custom Edge location added. 
  • Opportunity: Technology is required that views, manages, monitors and processes data from multiple Edge sources providing holistic Edge management without creating equipment or domain silos. Accomplished with products that manage ‘above’ the individual Edge components, integrating, abstracting, processing data, irrespective of what equipment is used. This allows enterprises to choose equipment from multiple vendors without impeding the business or Edge efficiency. The technology allows enterprises to change without impact, evaluate and compare different vendors equipment, have the information to negotiate price and support and avoid vendor ‘lock-in’. 

The Different Flavors of Edge Computing

Edge computing is slippery and undefined. The only statement that can be applied to all things Edge is it’s an environment designed to move data processing nearer business operations and the consumer.

The challenge with Edge is it has many variations, with professionals focused on Enterprise computing regarding it as an evolution of the distributed ‘lights out’ data center concept.

No matter how intelligent the end-point all Edge approaches share the same architecture. Core data center(s) with satellite locations that store and process data and interact with end-points. The number of Edge layers varies however, there is no standard Edge definition. There is no standard Edge environment. Edge is defined by each business and enabled by application architecture. 

Edge consists of network gateways, data centers and all things IoT. The purpose of the Edge is to deliver distributed application services, provide intelligence to the end-point, accelerate performance from the ‘core’ or collect and forward data from the Edge end-point sensors and controllers. 

Edge footprint can be the size of sensors and controllers, a small number of network/server racks, a container full of equipment or a large air-conditioned data center.  

Edge common attributes include equipment diversity, distributed remote locations, and ‘lights-out’ with no local support.

Edge Delivers

  • Data-stream acceleration, including real-time data processing without latency. 
  • Smart applications and devices to respond to data almost instantaneously, as its being created, eliminating lag time. 
  • Efficient data processing of large amounts of data processed near the source.
  • Reduced internet bandwidth usage eliminating costs, ensuring applications can be used effectively in remote locations.
  • Data processing without placing it into a public cloud adding a layer of data security
  • Better customer experience

So, What is Edge Computing? 

  • The Edge is a place, it’s where things are, and it’s not the data center or the cloud
  • The Edge will house the IT products that are in data centers and clouds
  • The Edge is not a hardware stack, it is defined by how software is architected and used to deliver business value 
  • Edge computing implementation varies wildly, defined by a broad range of use cases spanning industries and companies within industries
  • Companies will find Edge a major advantage and a major challenge as it changes how IT is delivered and managed
  • Edge didn’t just happen – for many large enterprises distributed IT was already being delivered however, it’s growing and becoming more business critical, demanding increased attention, investment and focus

Edge Computing Market Definition

The absence of an agreed and accepted Edge computing definition demanded we create our own. This has resulted in the Edge computing market being split into three different types of use-case:

  1. Enterprise Edge Computing. Remote ‘Lights Out’ Edge Data Centers.  Industrial Edge (control systems, self-contained environments). This use-case is sometimes referred to as Operational Technology (OT) but this narrative is narrow and just focuses on the unification of Data Center Information Management (DCIM) and systems monitoring.  The Remote ‘Lights Out’ Edge Data Centers use-case is a broad market where most enterprises reside. It is the closest relation to current data center and cloud environments. It can be a small equipment rack in multiple remote locations or multiple large data centers. It is the most diverse, non-standard Edge environment, requiring new organizational models, sophisticated software application architectures and a high level of abstraction to visualize, deliver low touch control and the ability to scale and manage a heterogenous, mix of equipment. 
  2. Container IT Edges. (Self-contain – cell tower, reselling something they failed at a decade ago ‘data center in a box’ – micro-data centers). This is left to the customer’s imagination, but it is where converged systems live. This Edge ‘in a box’ approach consists of a solution stack comprising of one or more of the following; servers, OS, storage, network and optimized power and cooling to support all the equipment in the contained environment. The containers are highly standardized however, customization is available to suit specific Edge requirements with options for additional components. This Edge option will continue to grow as it provides cost effective, scalability and the ability to place an Edge near the consumer no matter where they are.
  3. Internet of Things (IoT), where highly available processors enable real-time analytics for applications that can’t wait too many milliseconds to render decisions. IoT end-points continue to get smarter with greater ability to work independently and make decisions without regular communication with a core platform. They are becoming self-aware, discovering other IoT systems and working together to provide greater value. IoT end-points exist everywhere and can scale to millions of devices. 

The Edge content in this blog site focuses on the first definition. This is Edge Computing not driven directly by IoT or 5G wireless networks. It’s an Edge that for many enterprises has existed for years providing data and communications at remote locations and managed either using local staff or by the IT NOC team. However, the business importance of the remote locations continues to grow as more data and data processing moves to the remote locations that were once managed ‘good enough’. For these enterprises, especially the ones who had spent years optimizing the IT support organization and moved to cloud, the Edge challenge is going to get serious.

Finding Value in Automation

Screen Shot 2015-02-02 at 9.25.00 PMUnmeasured IT automation has little value.  Even though IT organizations appreciate what automation provides, once deployed it is rarely measured against value to the business.  When automation works it’s ignored, when automation breaks – it’s to blame.  When automation is developed internally the effort and cost are sucked up as part of a project or developed without any real accountability.  When IT organizations want to buy an automation product the justification is typically evaluated against head-count or time saving.  Initially, this is fine but if ongoing savings are not captured additional investment will need new justification – ideally based on business value.

Over the past year I’ve spoken with dozens of IT organizations to try to understand how they captured IT automation value. Responses varied but included “we capture it by logging activation’s” and  “we don’t capture it”.  Capturing automation activations is fine, but when something runs 300 times – who cares? The reason for not capturing automation value was the value of automation should be obvious.  The problem is it isn’t.

IT automation continues to be a major initiative, but for many it fails to deliver on its promise because the value it provides is not understood.

In February 2014 I started work on a practitioners guide to automation. I didn’t want it to be something that had to be read cover to cover (who has time for that?) or something that tried to be a definitive all encompassing automation book (too many automation variations).  I wanted to write something that provided a way companies could gain a quick understanding of their automation state, assess value using real calculations and then plot a strategic path forward.

To meet these objectives the following was done:

  • To keep the guide focused it addresses three specific automation use-cases; Provisioning & Configuration, Patching & Compliance and Cloud.
  • To assess automation state and set a strategic path the guide breaks each use-case into five levels (ad-hoc to advanced)
  • To ensure all aspects of automation were covered the guide covers: process, people and tools
  • To measure automation value each level has specific calculations for cost, speed and risk

Screen Shot 2015-02-02 at 7.07.02 PM

                                                       Automation Use-Cases Aligned to Automation Levels

Even though this document was written while working for BMC however, it is not focused specifically on BMC products and services.  BMC decided to call the guide the Automation Passport. It focuses on what needs to be done irrespective of what technology is being used or planned to be used. Companies are using it to plan an automation strategy with the process maps defining what management tools are needed, how they should integrate and what data/information is passed. If a company’s automation approach is to use open source automation toolkits (e.g. puppet and chef), embedded automation tools and home-grown talent for development, the passport provides a view on the scope of the automation required and the associated effort.

The following diagrams are examples of the detail contained within the guide.

Process

Provides a view of what needs to be done at each automation level within each use-case. This level of detail can be used as a foundation and tuned to meet each IT organizations specific requirements.

Screen Shot 2015-02-02 at 9.45.24 PMExample: Patch Process (a level within Patching & Compliance)

People

Provides a view on how an organization and roles change as automation maturity increases.

Screen Shot 2015-02-02 at 9.51.27 PM

Example: Roles for Provisioning & Configuration

Technology

Explains the types of technology used at each level in-line with the automation process (orange boxes indicate the technology introduced at this level).

Screen Shot 2015-02-02 at 10.04.05 PM

 Example: Technology  supporting  the Govern level process for the Cloud automation use-case

Last year BMC released the first edition of the Automation Passport. It contained the automation model mapped to the  Provisioning & Configuration and Patching & Compliance use-cases.

In January 2015 BMC released the Automation Passport Early Release Edition. This updated version contains the Cloud Automation use-case, value calculations for each use-case level, greater detail on automation roles and responsibilities (including job descriptions), cloud type definitions and explanations on how capacity, performance and availability management tools support and evolve to support the automation of cloud environments.

More detail on the passport can be found at the following location: http://www.bmc.com/it-solutions/automation-passport.html

The latest automation passport can be downloaded (no strings attached) from: http://documents.bmc.com/products/documents/36/96/453696/453696.pdf

Congratulations, your IT might be less sick today

help desk fire I came across an article in Computerworld titled “The Help Desk is Hot Again” articulating the revived popularity of the Help Desk. It explains that the Help Desk “serves as a vital liaison between employees’ mobile technologies and the networks, servers and applications that support them.” Help Desks certainly serve an important purpose however, this positioning feels slightly askew. For most IT organizations the Help Desk is where you go when you have a problem and need help. Help Desks do not understand how IT consumers are experiencing IT and are certainly not a liaison. I can see how there is a logical leap from issue management to evaluating the health of IT but do you go to the doctor when you are well?

Until recently, visibility into the consumer side of IT was not considered essential when measuring IT service availability. The assumption was that maniacally monitoring data center health provided enough data to show how effectively IT supported the business. For most organizations, IT availability and ‘end-user’ satisfaction is evaluated with metrics provided by the help desk, showing what went wrong and when. From the perspective of issues this may be acceptable but it hardly provides an accurate view on how the business is using IT. It would be like asking a doctor “so, how healthy does the world look today?,” where the answer would be “It looks pretty sick”.

This whole situation has been exacerbated by the use of mobile devices, the growth in non-corporate cloud-based application sources and the influx of people entering the industry who were born digital.  These new market entrants have learned to become more self-sufficient than any generation before and would rather have the flu than call the service desk. Many of today’s mobile issues are ‘fleeting’ with performance being a variable impacted by increasingly complex and congested network connectivity. For many, it’s easier just to wait it out.  Does the help desk capture this experience? No.

So, if the objective is to understand how IT is used and experienced, then you don’t start from the data center. The starting place is the IT consumer. This requires more than a set of tools giving visibility from ‘the edge,’ it will require IT support to organize and focus teams on IT consumer activity.  Measuring experience means understanding how IT is used, when it is used and where it is used, not just when there is an issue. Capturing, monitoring and analyzing IT consumer activity allows IT organizations to assess the true IT business impact, regardless of where the user is, what they are using or where their applications are sourced.

This approach is not going to be an easy for IT departments that have spent decades focusing on silo’d data center elements and back-end applications transactions. IT consumer activity monitoring is not an option. Users do not use one device, do not remain in one place and do not use just one application. IT innovation, mobility and IT consumer creativity will continue to push the limits of IT operations management with those able to adjust their IT management focus benefiting from greater IT decision making and business alignment.

The service desk must evolve to be a true high-touch solution and this can only be done when it is also used to monitor how all IT consumers are experiencing IT.   IT organizations that do not plan to focus on their IT consumers will be left struggling, trying to manage increasingly diverse IT needs using tools providing a datacenter centric application performance snapshot, stumbling their way towards the edge by trying to see through increasingly complex third-party service black-holes.

proactive sounds cool, but being reactive is just easier.

predisctiveRecently I’ve been involved in discussions about how new IT monitoring tools will make IT support teams smarter and far more proactive.  By smarter I mean having a greater understanding of IT health and by proactive I mean being aware of situations before or as they occur.

I’d argue that becoming smarter is a prerequisite to becoming proactive.  Monitoring for issues is so much easier when you know what you are looking for and understand the ramifications. The best way for IT support to become smarter is hire smartest, most experienced people. Becoming proactive is not so straight forward.

The idea that tools will make a reactive, crisis driven, IT operations team into a proactive one is nonsense. For decades monitoring tools have been able to set policy forewarning of events and giving support staff a heads-up on potential issues. The reasons this capability has not delivered on the promise are numerous; including events that are ‘potential issues’ or ‘warnings’ rarely classified as a high priority items, support staff not noticing them (or ignored them) or the method of event delivery being the wrong one.  It has had little to do with the monitoring tools. Reality is; most IT organizations are not measured on outage avoidance but on fixing issues once outages occur.

It’s easier to be the hero who got the order processing application back up than the person who said they had helped avoid the problem occurring in the first place (“you did what?” “oh sure you did, well done – help yourself to a medal”).   If an organization wants to be proactive then it needs to have people goaled and measured on finding issues before they become problems. Security officers actively monitor and analyze data to proactively identify anomalies,  irregular activity and behaviors, monitored events to stop hackers, cyber attacks, virus’ etc. Apparently, it is not acceptable to wait for security problems to occur before they get addressed.  For IT support to do this will require a number of changes including;

  1. an organization measured against outage avoidance.
  2. information delivered in ways that the support team will take notice of.
  3. information that means something and is actionable.

an organization measured against outage avoidance. An IT organization that prides itself on being proactive but measures itself against MTTR or MTBF is not fully proactive. The speed IT operations responds and fixes an issue is not a good measure of proactive efficiency without factoring in the speed issue was detected in the first place. IT operations effectiveness would have greater relevance if it was tied to outage avoidance.  This type of metric is not easy to capture using monitoring tool reporting (too many sources, limited business impact assessment) so it requires a way to immediately consolidate, log and track the identification to remediation process. The easiest way to do this is using a service desk.  This information would demonstrate how IT operations provides value, while showing increases in IT operational efficiencies.

information delivered in ways that the support team will take notice of. IT organizations invest a lot of time and effort trying to detect and process events, but few put the same effort into ensuring events are immediately delivered to the right IT personnel. A proactive state dictates that event data is delivered and owned as soon as it is detected. This means the mechanism chosen to deliver the data is as important as the effort associated with collecting the information in the first place. Most IT organizations still rely on event management tool consoles; however, an unwatched console will result in missed events. Sending events to mobile devices (e.g. in the form of an IM) and/or the use of alert notification tools can reduce the time it takes to become event-aware. Alert notification tools help support a proactive objective by automating the delivery of alerts to the appropriate IT operations personnel through the most-effective communications channel, in support of established escalation and outage procedures and also provide the mechanism for an event to be delivered, acknowledged and owned.

information that means something and is actionable. If you are not actively looking for something, it’s unlikely you’ll find it. A blindingly obvious statement but when monitors are being used in IT operations they are typically being used to aid root-cause-analysis on known reported issues where support knows there’s an issue and understands the sort of thing they need to look for. However, when there is no obvious problem it takes skills and experience to scroll through long lists of technical event data to identify the most critical, business impacting issues.  Knowing how things relate to the bigger picture requires the skill to assess the overall impact of multiple unassociated events and that means taking the yellow ones as seriously as the red ones.  This approach is the new way IT support must work, looking for subtle changes and behaviours in the IT infrastructure, applications and IT consumers, analyzing potential impacts and executing a plan to remediate the issue before it effects the business. This approach demands dedicating support personnel to IT analysis and moving them away from monitoring consoles when they have time or are motivated to do so by complaints from IT consumers.

If you ignore the price to the business, being reactive doesn’t cost a thing.

if NASA monitored like IT operations would they have made it to the moon?

rocket2In nearly every job I’ve had IT monitoring has been somewhere, either core to my day job or peripherally around the edge. Even though monitoring has been with us for decades it still attracts massive amounts of attention from IT organizations, vendors and Venture Capital. Red, green, yellow, yellow, green, red, how hard can it be?  There have been major shifts in finding new ways to understand the health of IT including; SNMP monitors in the early 1990’s and, more recently, the various flavors of APM products. For a software company to make a difference and successful selling a product in this space it really needs to innovate and provide something better. A lot better.  So I get tired when people say, “monitoring, it’s done isn’t it?”

It’s not. Not by a long long way.

Gartner published a report in May 2013 titled Market Share Analysis: IT Operations Management Software, Worldwide, 2012 (ID: G00249133). In this report it says that the 2012 application performance monitoring (APM) market is over $2 billion growing at 6.5% with the availability and performance monitoring market (IT infrastructure monitoring) being $2.8 billion growing at 7.6%. Even though these IT monitoring areas are considered separate market spaces the ideal is to combine them allowing IT organizations to understand the impact the IT infrastructure has on the applications and visa versa.  So when both areas are combined they become the largest IT management market segment with over 25% of the $18B total market. To put this into perspective the joint APM/Availability and Performance revenues (~$4.8B) is larger than configuration management, the second largest market segment, by over $1B which is also growing at a slower rate (6.3%).

Large. small, service provider, telco, SMB or enterprise, everybody has monitoring so the fact that it remains the highest growth IT management space is amazing. Even though it’s a huge market not dominated by a few vendors. It is a highly fragmented space with dozens of vendors and hundreds of tools.

Monitoring remains one of the most fragmented IT management spaces with tools from dozens of vendors ranging from $free to $hundreds of thousands. To remain relevant demands constant innovation with innovation coming from many areas including event collection, event consolidation, event processing, event reporting, ease of use, low complexity, high sophistication, product delivery, and product pricing and licensing. With the need to get clarity on IT services and also reduce the cost and effort to achieve it better ways to monitor are constantly being sought.

all monitoring is not the same
When people think of monitoring an image that comes to mind is of NASA and the way it monitors a moon launch. Dozens of people intensely looking at monitors anxiously looking for irregularities and working closely with all their colleagues to identify potential issues that may impact the success of the objective and the safety of the astronauts. Even though each person may have a different view of the health of the mission collaboration between the team members ensures that at an holistic view is understood at all times. Throughout the mission priorities change so does what and how each stage is monitored. In addition, the information displayed on the monitors is continually analyzed and correlated with other data with the objective to seek out potential issues that the individual monitoring displays may not make clear.  NASA monitors space missions with the assumption that something will go wrong demanding an immediate response to remediate the problem and ensure the success of the mission.

putting too much emphasis on the tools
For decades ITmonitoring ship professionals have used products to give them visibility into the health of the IT infrastructure which is monitored in fragmented piece parts with disparate non-collaborative teams all providing different vieship dragging astronautws on the health of IT. For many monitoring is accomplished when resources are available and unlike NASA most IT organizations assume everything is fine and look to monitoring to confirm a reported outage and to aid root-cause analysis.

IT organizations depend on tools to provide an understanding on the state of IT. Unfortunately IT continues to fragment and increase in complexity driving organizations to employ more monitoring tools in an attempt to gain clarity on overall IT health. However instead of making things easier to understand this creates additional challenges with each IT support organization providing increasingly different and potentially conflicting views on the health of the IT infrastrScreen shot 2013-07-24 at 10.49.14 AMucture. Some organizations using dozens of monitoring tools covering every aspect of their IT environment have no ability to clearly identify issues and the impact they have on the business. With each IT support team looking through different monitoring lenses the ability to gain and holistic trusted view becomes almost impossible.

avoid liability and attribute blame
When the business is impacted by an IT issue many organizations bring together the different IT support teams to help identify what the issue was, how it was detected and how to avoid the issue occurring again. Even though the senior IT executives do this to pacify and assure the business of IT’s competency and value each IT support organization will use their monitoring tools as evidence with which to prove either it was not their issue or show that the issue was identified and resolved in-line with company policy and service levels. This behavior changes monitoring from a proactive, issue avoidance practice to one where it is used to prove innocence and assign blame.

infrastructure availability does not equal application availability
IT problem optionsRoutinely IT support organizations use the statistics gathered by their monitoring tools to show effectiveness, IT availability and business value. Each IT component is monitored to a set of policies primarily derived  by how each IT team associates value to the components. The traditional 99.9% availability objective is still used by IT operations as a way to show IT availability. Unfortunately the business does not equate availability with how each component is functioning. IT availability is measured by the performance and availability of the applications and the support the IT organization provides. These two viewpoints on how IT value is measured creates confusion and conflict with IT support teams unable to comprehend the fact  that the business does not care about the individual health of each IT component. A business manager will assess the value of the IT organization based on the opinions and input of the  people who consumed the IT resource and not on a mountain of confusing, irrelevant technical detail that conflicts with  the IT consumer experience. In some cases this situation will drive the business to seek alternative IT providers for new applications and IT services.

how much are IT service quality problems costing business?
The reality is that while monitoring is employed in nearly every business that uses IT is not used effectively.  While tools for monitoring are designed to provide proactive warnings of issues the effectiveness of the tools can only be realized when they are used to show business impact augmented by an organization focused on proactive monitoring practices and collaborative team work. Being proactive requires more than just monitoring tools, it requires;

  1. an organization that actually seeks out issues
  2. information delivery mechanisms that the support teams will take notice of
  3. information delivered in meaningful ways, preferably associated with service levels and business impact

monitoring evolved
Even though monitoring continues to be updated it’s an evolution not a set of dramatic changes.  In the 1990s the focus was on the data center elements because for many that is where a majority of the IT resources were. Over time the need to understand how IT resources were being provided moved monitoring from basic availability to measuring performance and a set of processes and best practices to ensure specific outages and IT service degradations did not occur again.  More recently monitoring has evolved in multiple directions. The dynamic nature of the IT infrastructure demands that monitoring is able to keep up with constant change and business priorities.  This demand has created a new set of monitoring tools that dynamically discover IT components, establish relationships through various communication methods and dynamically map, in real-time, how IT resources are used in support of the changing needs of the business. The highly distributed and fragmented IT infrastructure created a demand for tools that can actively search and associate disparate data from disparate sources and then provide, through analysis, information on IT health that could not be achieved by the more traditional monitoring approaches.  And lastly, the way business consumes IT has forced many IT organizations to focus on the end-user experience.  Only by focusing on how end-users consume IT resources will the IT organization be able to fully understand and support the business.

Summarizing all this…
IT and business are synonymous. Monitoring IT like it’s a network and a bunch of servers going to result in the business demanding more relevant and accurate service measurement – specific to applications availability and performance and IT consumer experience.  The critical impact IT has on business means executives continually evaluate the support and services provided by the IT organization and assess ways for improvement.  For business IT value is a very easy metric to measure; availability, performance, responsiveness, flexibility and support. In addition, IT consumers have become major influencers of how IT services are evaluated, delivered and consumed demanding a different view to understand the health of IT services.  As IT consumers use IT resources beyond the corporate data center the value of IT is assessed as an overall experience no matter where applications are sourced, what access methods are used or where support is located. The only way to fully understand how the business views IT services is to monitor how IT consumers use IT.

High volumes of disparate event data creates confusion and conflict demanding technology that consolidates, correlates and prioritizes issues aligned with how the business consumes IT services.
IT organizations will still use tools that monitor specific IT elements as these allow specialists to have a greater/deeper understanding providing the ability to identify a problem’s root cause however, these types of monitoring tools are used as event sources feeding monitoring products able to consolidate, filter, correlate and prioritize issues in line with IT service delivery. The ability to achieve this objective demands technology that can easily integrate and associate data into information relevant to both the IT organization and the business.

a path to improving end user experience

smilie 2I don’t believe anyone can dispute the growing influence end users have on how IT services are chosen, sourced and evaluated.  This does not mean IT operations organizations are ready to fully embrace the end user as a specific focus.  Many assume application transaction monitoring and mobile device software update support is enough – at least for the time being.  The reality is it isn’t enough and treating the end user like peripheral hardware is not to their benefit. This is managing the situation – not enabling the end user.

Improving end user experience is not about keeping an eye on them or trying to support their mobile devices it’s about removing IT barriers, reducing complexity and making them more self sufficient and productive. This objective is best broken down into logical areas;

  1. Support
  2. Social Enablement
  3. Security & Resilience
  4. Productivity

Each area has a set of activities and objectives:

  • Support: Identify, address and report common/local issues, pre-emptive problem management and real-time end user IT status specific their individual needs and priorities.
  • Social Enablement:  Social, communication and collaboration tools to foster and enable information flow between different users with common interests, goals and objectives.
  • Security & Resilience: End user and device authentication, content protection and data protection and recovery.
  • Productivity: BYOD enablement allows the conducting of business from any device and location. Users download and given access to applications and access to local resources and information on company facilities based on their specific needs and within company policy.

It is unrealistic to think the objectives for each activity can be accomplished all at once. They are only achievable if each activity has a path containing logical, measurable steps.  This is also needed as each activity can have ties to others (e.g. to deliver a level of support requires a level of security and resilience).

In the paper Path to Improving the End-User Experience the activities are explained and broken down into the five levels (undefined, reactive, proactive, service and business) providing objectives to assess the current end user environment and improve upon it.

A barrier to success is IT operations’ need to enable the users from the datacenter perspective.  If the end user is the focus then the starting point is the end user (do IT users care about the datacenter?).  However, to show value a plan must have two perspectives, one IT operations and the other the end user.  In the paper each level describes the activity and value to both IT operations and the end user.  This allows IT operations to associate effort and investment directly with end user productivity.

Improving end-user experience, satisfaction and making them more productive increases a company’s effectiveness and makes it more competitive. It’s a no-brainer.

IT Infrastructure monitoring. Red, green, yellow is no longer enough.

health headA view on the health of the IT infrastructure is accomplished using monitoring tools – lots of them. This has been the approach for decades with differences revolving around how the data is collected (the age old agent vs. agentless argument), integration provided, how data is processed, how the tools are purchased, and increasingly creative ways to display red, green and yellow.  However, it doesn’t matter if you use a high-cost, low-cost or no-cost monitoring tool the objective remains the same – get a view of the health of IT.

IT infrastructure monitoring is not glamorous but it is required, how else is IT operations going to confirm an issue reported through the service desk? However, the way monitoring is used today is not suitable for many of the requirements for monitoring moving forward.

Monitoring is splitting into two distinct approaches and depending on what you need from monitoring will determine how and what tools are used.  The first approach is the traditional one, monitoring IT health. The second is using monitoring to enable an action which means; collecting and analyzing specific information and  using it to support an automation procedure or running an action.

An example of the second approach is monitoring the performance of a cloud IT infrastructure stack where the objective is not simply to understand the health of the cloud environment but enable capacity to be dynamically allocated and changed in-line with usage or need (aka cloud elasticity). Add to this the fact that cloud environments are moving from server and storage capacity to application services then the ability to make changes in the cloud becomes far more complex. E.g. making a change to a cloud database may make sense for one application but have a detrimental impact on others.

Even though traditional monitoring performance tools are being used to provide a view on cloud health their ability to support decision making is problematic (see diagram 1).

monitoring diagram11. Performance policy is defined within each monitoring tool focused on specific element and element type thresholds – not on overall cloud service performance.
2. Performance monitoring data does not show how one element’s performance impacts another (e.g. how changing a server or network configuration impacts multiple applications) – creating an inability to make trusted changes.
3. Challenges in pulling together (in real-time) multiple performance feeds into a coherent service or application view – creating a ‘lag’ in making changes and the need for multiple tools and teams to be involved.

Integrating multiple performance feeds to assess overall application/service impact requires a highly sophisticated performance consolidation tool that normalizes, consolidates, filters data and provides an accurate service impact that can be used to support or trigger an action.  This tool does not exist.

However, there are sophisticated capacity tools able take performance data from multiple performance sources and optimize IT resource as a service (e.g. BMC’s BCO product). The best results are achieved when the data received from the supporting performance tools focuses specifically on the environment being provisioned/updated.  This enables services to be changed (e.g. orchestrated through a service governor) with greater accuracy (e.g. supporting service placement or making decisions on requested changes in context of impact). monitoring diagram2

The future of IT infrastructure monitoring includes processing specific information to make trusted decisions.    For example; cloud monitoring will have policy derived from the cloud blueprint (cloud service component architecture) with possible input from other sources (e.g. a service catalog to guide service levels) see diagram 2.  This will result in one set of policy aimed specifically at the IT components supporting the cloud services to both assess cloud health and provide the actionable information needed to make safe changes. This differs from the traditional monitoring I mentioned previously which collects data from everything and then tries to apply filters and rules to reduce content to provide a true view of IT health.

Focusing on an outcome by monitoring a specific set of components allows the capacity tool to provide accurate placement decisions that can be executed through the governor and the provisioning and configuration tools.

Automated cloud decision making is just one example of the way monitoring is evolving. The same value could be attributed to any IT infrastructure automation initiative including agile development practices (e.g. DevOps).

Today we are at a crossroads. IT operations tools developed to monitor IT infrastructure health are increasingly being considered to provide highly accurate information to support automated decision-making . Even though this re-purposing can be achieved,  the effort, cost and complexity is going to be prohibitive and reminds me of a line from an old Irish joke, “sir to get there, I wouldn’t have started off from here”.

It’s not as if a totally new set of tools is required, although for the cloud IT decisions may be provided from monitoring embedded in cloud management solutions.

monitoring xls

Diagram 3

Diagram 3 describes 5 areas of differentiation between tools used for monitoring health and one’s designed/used for aiding decisions.  The most important differentiation being; the objective. This dictates policy, the environment monitored and the integrations required.  If you want to make decisions you set policy based on the decision being made, if you want to check for infrastructure health you set policy based on component thresholds.

Just when you thought monitoring was already complex. It’s about to get more interesting.