Observability Shifts Right

By William Cappelli

Observability first emerged as a focal point of interest in the DevOps community in the 2017 time frame. Aware that business was demanding highly adaptable digital environments, DevOps professionals realised that high adaptability required a new approach to IT architecture. Whereas historically, digital stacks were monolithic or, at best, coarsely grained, the new stacks would have to be highly modular, dynamic, ephemeral at the component level, and spread over multiple cloud-based services. This new approach to architecture, in turn, forced a new approach to monitoring, one based on the real-time discovery of patterns in streams of highly granular telemetry rather than the comparison of sampled event records against a pre-defined model of the system being observed.

This interest ‘shifted right’ over the next three years as, at first, application operations managers, and then IT infrastructure operations managers came to recognise the need for a new approach to monitoring even when it came to technology domains less directly impacted by the ‘adaptability revolution'.

There were two reasons for this:

Principles governing the new system architecture introduced by the DevOps community guided the ongoing modernisation and refactoring of legacy applications and infrastructure;
the newer systems inevitably became intertwined, from a performance and availability perspective, with the stacks already in place.

Biases

Each community, of course, brought with it a distinctive bias in the understanding of what made observability different from traditional monitoring and, to a certain extent, this bias was stoked by the sources from which the community traditionally acquired its software. DevOps professionals looked primarily to Open Source and to start-up vendors building new products targeted at observability from the ground up. Application operation managers took their cues from the APM vendors who attempted to frame observability as a relatively minor modification of application performance and availability monitoring. Finally, IT infrastructure operations managers with their long history of working with framework and event management vendors tended to view observability as a variation on AIOps or next-generation event management. Despite their biases, enough of the vision was shared to make observability technology a coherent market category with a combined emphasis on telemetry and automated real-time pattern discovery acting as the core organising principle.

Over the last six months, however, a fourth community has emerged to express interest in and influence buying decisions around observability, the community of Service Management professionals. Service Management as a community and as a collection of functions and processes has, in most enterprises, maintained a somewhat remote relationship with the worlds of Application Operations Management and IT Infrastructure Operations Management. In theory, the signals and alerts picked up by application operations managers and IT infrastructure managers ought to be directly routed to the service desk so that any end user or customer issues with the digital environment can be anticipated and even resolved before a user or customer sends a notification. In practice, however, very little communication occurs. Close to 90% of the notifications to the service desk do come from users or customers while the application and infrastructure managers only pass on about 30% of the signals they are receiving. In other words, the channel meant to link the monitoring function to the problem and the incident remediation function is broken.

This has, in turn, meant that the Service Management community has been somewhat insulated from the changes going on elsewhere in IT. Nonetheless, it too has eventually come to an increase in the number of incidents reported, the amount of noise that needs filtering, and the growing difficulty of diagnosing root causes. Much like their IT Operations colleagues, Service Management professionals have looked primarily to AI and Machine Learning to meet the new challenges they face with some acknowledgement of the need to expand telemetry beyond event records and logs in the direction of metrics and traces. Nonetheless, their history, culture, and the vendors with whom they typically interact have given the Service Management interpretation a distinctive colour.

CMDBs - The Once and Future Project

For more than two decades, the ideal of a Configuration Management Database (CMDB) has been central to the Service Management view of the world. Prescribed as a best practice artefact in ITIL 2.0, attempts to construct a ‘single source of truth’, an accurate description of what is in an enterprise’s IT estate and the topology that structures that estate have been ongoing since the early 1990s. (It should be noted that ITIL itself in later versions replaced the CMDB with more loosely defined constructs but none of these have achieved canonical status.) Indeed, the idea is intuitive and attractive. If there were a single, detailed, truthful record of how the IT estate was laid out, then incident and problem management tasks would be far easier to execute.

It is this vision that the Service Management community has brought with it as it begins to consider and evaluate observability technologies. Right away alarm bells should be ringing since it is hard to avoid seeing parallels between the concept of a CMDB and the concept of a predefined data model of the environment the centrality of which to traditional monitoring technologies was perhaps the main reason for their failure. And, indeed, when participating in discussions about observability with Service Management representatives, the idea of a predefined data model reemerges but, now, in the form of a link to the CMDB, which will act both as a prescriptive standard against which to compare signals coming from the observability platform and as a source of rules and topologies for AI-enabled correlation and causal analysis.

Unfortunately, most CMDB projects have ended in failure. If truth be told, the IT estate of large enterprises has always been too complex and volatile to lend itself to representation by a big all-seeing, all-knowing database that sits in the middle of everything. In the past, the main blocker has been the limitations of human knowledge. There was simply too much happening from an IT perspective for any team to know enough to put together and maintain such a database. With modern levels of digitalisation, that is less of an issue. One can track and trace most elements that constitute a digital environment in an automated way. The big problem, now, is the scale required to ingest all of that information, the accuracy and latency of automated discovery, and, finally, the ability of human beings to interpret what has been discovered and what is being presented. Even with the high levels of automation now possible, the rate at which modern environments evolve outpaces the rate at which information can be accurately captured and the resulting ‘digital twin’ of the environment be maintained.

Hence, as Service Management professionals enter the conversation about what is required for observability, efforts must be made to ensure that the concept of pre-defined governing data model does not reemerge as a requirement. If linkage to an enterprise CMDB effort comes to be seen as critical, then that linkage should be a loose one. Information drawn from the CMDB could certainly act as another stream of input to whatever AI or ML algorithms are being used to surface patterns in the telemetry streams and the observability technology itself may, if queried on a regular basis by the Service Management function, provide some correctives to the inevitable inaccuracies of whatever automated discovery system is in place. Finally, alerts, anomalies, and causal analyses generated by the enterprise observability function should be made to feed directly into the incident and problem management processes as a supplement to the notifications normally received by the service desk. In other words, there is no question that Service Management has a lot to give to and a lot to get from observability. It is important to make sure that the influence of Service Management does not, inadvertently, reintroduce the legacy monitoring design plan.

The Next Five Years

Ever-deepening and expanding digitalisation, coupled with a growing role to be played by AI, points to a future in which the end-to-end flow from digital environment distress signal to response will become automated. Optional and, at times, mandated human intervention will be engineered into that flow but actual instances of intervention will be increasingly few and far between. In such a scenario, the hard boundaries and biases that currently separate the four IT functional communities will blur and maybe, in some cases, disappear completely. At that point, of course, the various understandings of observability are likely to converge and most of the issues discussed here will become historical footnotes. Nonetheless, the high levels of automation I am anticipating are still at least five to seven years away so enterprises should prepare for a lot of robust discussion and artful compromise when it comes to prioritising observability functionality.

Market consolidation will continue on the vendor side of the equation. As observability comes to be seen as a set of functionalities relevant to all IT functional communities (even those functional communities, like the Security Management community, beyond those discussed in this paper,) enterprises will increasingly demand tight integration between observability systems and the rest of the digital infrastructure. Furthermore, and perhaps ironically, the growing centrality of Open Telemetry to observability implementation efforts will dampen the fears of potential vendor lock-in that usually cause enterprises to hesitate when selecting a solution that is tightly integrated with a broader portfolio of tools. Hence, over the next five years, on the path to maximal automation, we expect vendors like Splunk that are able to embed observability in a broader matrix of IT functionality to dominate product choices.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.