Platform

February 04, 2021

2 Minute Read

Smarter Root Cause Analysis: Determining Causality from your ITSI KPIs

By Greg Ainslie-Malik

Root cause analysis can be a difficult challenge when you are troubleshooting complex IT systems. In this blog, we are going to take you through how you can perform root cause analysis on your IT Service Intelligence (ITSI) episodes using machine learning, or more specifically causal inference.

The approach shown here is included in the Smart ITSI Insights app for Splunk, with this blog largely detailing how to use the ITSI Episode Analysis dashboard. Before we get going with the content it is worth mentioning that the capabilities shown here are dependent on having version 3.4 of the Deep Learning Toolkit installed and operational.

Episode Analysis

To begin with we’re going to take a look at all of our episodes in ITSI using the ITSI Episode Analysis dashboard in the app. We can choose to view these by criticality, or over a specific time window.

A few basic reports are displayed about the episodes, with trend lines by service over time and a breakdown of the affected services as well so you can see at a glance if there is a particular service that appears problematic. Beneath these reports is a table listing all of the episodes, detailing the time it was raised, the title of the episode, the service that is affected and the severity of the episode.

ITSI Episode Analysis

Causal Analysis

If you click on any of the episodes in the table some dashboard panels will start to populate below. These dashboards present the causal relationships between the KPIs that the affected service relies on – showing which KPI are affecting each other.

The calculations are performed over a 4-hour window immediately prior to the episode being generated, so we can quickly assess what these relationships look like ahead of an episode being raised.

The table displays all of the KPIs that appear to have a direct impact on the health score of the affected service – in other words, these are the likely culprits behind the episode being raised. Beneath the table, you will also be able to see a chart that highlights all of the relationships between the KPIs for the affected service. You can hover over this chart to see the relationships for a given KPI.

Root Cause Analysis

If you click on the table that shows the service linked to the root_cause_kpis you will be taken to the ITSI deep dive dashboard, with a swim lane for each KPI in the table. The data on display covers the window 45 minutes prior to the episode and 15 minutes after the episode being generated – so an hour around the episode.

Root Cause Analysis

In this example, you can see that the likely cause of the episode being generated is that the disk space used was running very high.

Hopefully, you have seen in this blog how you can easily determine root cause from your episodes using machine learning and will be able to more easily identify the source of your problems across your environment.

Happy Splunking!

Greg Ainslie-Malik

Greg is a recovering mathematician and part of the technical advisory team at Splunk, specialising in how to get value from machine learning and advanced analytics. Previously the product manager for Splunk’s Machine Learning Toolkit (MLTK) he helped set the strategy for machine learning in the core Splunk platform. A particular career highlight was partnering with the World Economic Forum to provide subject matter expertise on the AI Procurement in a Box project.

Before working at Splunk he spent a number of years with Deloitte and prior to that BAE Systems Detica working as a data scientist. Ahead of getting a proper job he spent way too long at university collecting degrees in maths including a PhD on “Mathematical Analysis of PWM Processes”.

When he is not at work he is usually herding his three young lads around while thinking that work is significantly more relaxing than being at home…

Platform 2 Min Read

Splunk App for Amazon Connect: End-to-End(point) Visibility for an Optimal Customer Experience

The Splunk App for Amazon Connect now includes the ability to detect softphone issues at the endpoint and take an action to minimize the impact on customer experience.

Platform 3 Min Read

Splunk Data Manager’s Custom Logs: Expanding AWS Log Ingestion Capabilities

Antoni Komorowski shares how Custom Logs in Splunk Data Manager can help improve your log management experience.

Platform 4 Min Read

Splunking Azure: NSG Flow Logs

Splunking NSG flow log data will give you access to detailed telemetry and analytics around network activity to & from your NSG's.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram

Follow @Splunk

See Splunk Perspectives blog for execs

Get Perspectives

Smarter Root Cause Analysis: Determining Causality from your ITSI KPIs

Episode Analysis

Causal Analysis

Root Cause Analysis

Related Articles

Splunk App for Amazon Connect: End-to-End(point) Visibility for an Optimal Customer Experience

Splunk Data Manager’s Custom Logs: Expanding AWS Log Ingestion Capabilities

Splunking Azure: NSG Flow Logs

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram

See Splunk Perspectives blog for execs