Why ITOps Need to Rage WITH the Machine (Learning)

Historically, IT professionals have been sceptical of the role of machine learning with IT Operations. Many feel that the industry has been talking about it for almost 10 years, yet they still find themselves doing things manually, as they have always done.

In our everyday lives, we cannot pretend that machine learning doesn’t exist; whole business models are being based on it. For example; some retailers utilise it to proactively send customers clothing in the anticipation they will like and purchase items; it’s likely that you unlock your phone thanks to advancements in machine learning; and it even enables Google to sort through, tag, and classify your photos automatically, based on the objects detected (caravans, carnivals, cats…)

In terms of value, machine learning allows IT Operations to focus its time on the more enriching or higher-value tasks. Machine learning shouldn’t be different or difficult for these teams - in fact, I would say that it’s absolutely essential when faced with today’s IT challenges.

Applications are now loosely coupled with infrastructure; fragmented between physical, virtual and platform-as-a-service tiers, running within containers and based on microservices. Agile development, automation and infrastructure-as-code is accelerating this at ever increasing speeds. A modern application delivery toolchain is as complex and automated as a next-generation production line. Meanwhile, the reputation of a brand that in some cases has taken decades to build, can be damaged in moments, thanks to an outage or security vulnerability.  

Closing the feedback loop with real-time insights is required at all stages of the application lifecycle to ensure continuous improvement. But as this is based on such large and complex data sets - how do humans keep up? The answer is that they simply can’t. Organisations must detect poor service quality immediately. Mean time to detect (MTTD) must be seconds, if not negative value - detecting an incident before it actually happens.

To gain the confidence of IT professionals, footsteps need to follow where organisations such as the likes of Google, have already been successful; developing packaged machine learning for very discrete, but high-value use cases. Event Management (EM) from the perspective of IT Operations would be a great example, for example. An efficient process of managing events is essential for any organisation that requires delivering top-quality services.

However, IT teams are struggling to manage the volume, complexity and unstructured nature of events generated by monitoring solutions. These events are essential to detect and triage incidents, but there are a number of challenges in managing, including:

  • Teams spending more time managing events, creating rules and configuring thresholds than actually fixing the problem.
  • Increasing number of events facing IT Ops teams, leading to high MTTD of incidents and poor incident management.
  • False positives leading to alert fatigue, having a higher impact later down the line when facing a genuine problem.
  • Alert storms leading to duplication of incidents, a lack of correct prioritisation and high MTTD.

diagram service issue noiseJust like Google Photos, imagine if all your different alerts and events could be categorised and classified through unsupervised machine learning. Thresholds could be automatically set based on baselined behaviour and alerting instantly when there is an anomaly. This would mean less time spent managing the events, less events to analyse overall, and more time focused on what’s most important - improving your services.

Splunk is focused on packaging machine learning for discrete, high-value use cases within IT operations. Check out our whitepaper “Make Your IT Events Less Eventful” or try out the free ITSI Event Analytics sandbox.

Splunk IT events banner


Guillaume Ayme

Posted by