Why ITOps Need to Rage WITH the Machine (Learning)

By Splunk

Historically, IT professionals have been sceptical of the role of machine learning with IT Operations. Many feel that the industry has been talking about it for almost 10 years, yet they still find themselves doing things manually, as they have always done.

In our everyday lives, we cannot pretend that machine learning doesn’t exist; whole business models are being based on it. For example; some retailers utilise it to proactively send customers clothing in the anticipation they will like and purchase items; it’s likely that you unlock your phone thanks to advancements in machine learning; and it even enables Google to sort through, tag, and classify your photos automatically, based on the objects detected (caravans, carnivals, cats…)

In terms of value, machine learning allows IT Operations to focus its time on the more enriching or higher-value tasks. Machine learning shouldn’t be different or difficult for these teams - in fact, I would say that it’s absolutely essential when faced with today’s IT challenges.

Applications are now loosely coupled with infrastructure; fragmented between physical, virtual and platform-as-a-service tiers, running within containers and based on microservices. Agile development, automation and infrastructure-as-code is accelerating this at ever increasing speeds. A modern application delivery toolchain is as complex and automated as a next-generation production line. Meanwhile, the reputation of a brand that in some cases has taken decades to build, can be damaged in moments, thanks to an outage or security vulnerability.

Closing the feedback loop with real-time insights is required at all stages of the application lifecycle to ensure continuous improvement. But as this is based on such large and complex data sets - how do humans keep up? The answer is that they simply can’t. Organisations must detect poor service quality immediately. Mean time to detect (MTTD) must be seconds, if not negative value - detecting an incident before it actually happens.

To gain the confidence of IT professionals, footsteps need to follow where organisations such as the likes of Google, have already been successful; developing packaged machine learning for very discrete, but high-value use cases. Event Management (EM) from the perspective of IT Operations would be a great example, for example. An efficient process of managing events is essential for any organisation that requires delivering top-quality services.

However, IT teams are struggling to manage the volume, complexity and unstructured nature of events generated by monitoring solutions. These events are essential to detect and triage incidents, but there are a number of challenges in managing, including:

Teams spending more time managing events, creating rules and configuring thresholds than actually fixing the problem.
Increasing number of events facing IT Ops teams, leading to high MTTD of incidents and poor incident management.
False positives leading to alert fatigue, having a higher impact later down the line when facing a genuine problem.
Alert storms leading to duplication of incidents, a lack of correct prioritisation and high MTTD.

diagram service issue noise Just like Google Photos, imagine if all your different alerts and events could be categorised and classified through unsupervised machine learning. Thresholds could be automatically set based on baselined behaviour and alerting instantly when there is an anomaly. This would mean less time spent managing the events, less events to analyse overall, and more time focused on what’s most important - improving your services.

Splunk is focused on packaging machine learning for discrete, high-value use cases within IT operations. Check out our whitepaper “Make Your IT Events Less Eventful” or try out the free ITSI Event Analytics sandbox.

----------------------------------------------------
Thanks!
Guillaume Ayme

Three Ways to Get Uptime in No Time

3 ways you can maximize your uptime—helping you protect your business and satisfy your customers.

IT 1 Min Read

A Preview of Splunk for Hyperledger Fabric

Splunk provides operational insights into a complex enterprise blockchain like Hyperledger Fabric.

IT 8 Min Read

Introducing Pulsar IO

Jerry provides an overview of Pulsar IO, a framework for moving data into and out of Apache Pulsar, explaining how it works and how to build connectors using Pulsar IO.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Why ITOps Need to Rage WITH the Machine (Learning)

Related Articles

Three Ways to Get Uptime in No Time

A Preview of Splunk for Hyperledger Fabric

Introducing Pulsar IO

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram