Custom Anomaly Detection with Splunk IT Service Intelligence and Machine Learning Toolkit v3.2 - Part 1

Iman Makaremi & Andrew Stein recently worked with a customer participating in the Machine Learning Customer Advisory Program around customizing their custom Splunk IT Service Intelligence (ITSI) Machine Learning workflow. This is a detailed and technical walkthrough of what was operationalized using the Splunk platform by them.

Catching anomalies in data is like fly fishing—there are a lot of different fish you can catch with a woolly bugger or a zonker or a royal wulff, and that’s why nearly every fisherman carries them stuck to their silly-looking-but-useful fishing hat. Some rivers, however, require a customized fly to lure the big ones out of the deep. If you’ve already been using Splunk ITSI, you’re already catching a lot of anomalous fish with its broadly-useful algorithms, and that’s why it's on the tip of so many fishing lines in data-driven organizations around the world.

Today we’re going to discuss tying a custom fly for your river to lure from the deep an additional, specific type of anomaly that you’ve been chasing using data in Splunk ITSI.

We’re going to leverage the latest version of the Splunk Machine Learning Toolkit (MLTK 3.2) to tie your custom fly. If you need a refresher on the free Splunk MLTK app and how it works with the ITSI premium Splunk app, take a detour and review that before continuing. Once you feel ready, don your rubbery fishing pants and let’s wade into the river of data.

Packaged Anomaly Detection in Splunk ITSI

Today, there are three kinds of time series anomaly detection algorithms packaged inside Splunk ITSI that do not require a data scientist to use—they work out of the box: Adaptive Thresholding, Trending, and Cohesive.

The Adaptive Thresholding system baselines from your data and creates meaningful thresholds automatically for your KPIs based on the time policies that you define.

Splunk ITSI also has proprietary algorithms for finding meaningful anomalies that thresholds would never find. The Trending algorithm monitors a single KPI by comparing its current behavior to its past. A rolling window of the KPI is monitored using a scoring function that is based on nonparametric statistics. If the score grows anomalously large, the algorithm raises an alert.

The Cohesive algorithm instead raises alerts if the behavior of one or more KPIs in a group of them that are expected to behave similarly changes. Similar to the Trending algorithm, this one calculates scores on a rolling window of the KPIs and reports on those that have significantly high scores.

A Different Kind of Anomaly Detection

The packaged machine learning anomaly detection algorithms in Splunk ITSI have been successfully shown to cover a large subset of time series anomaly detection use cases in ITOA. But, if you are an ITSI master, you may want to expand your anomaly detection coverage with custom machine learning. Let's go through those steps.

In many large environments, the behavior of a subset of KPIs can impact the behavior of other ones. How some KPIs affect other ones depends on the behaviors of the environment. The behavior of an environment is usually expected not to change much, otherwise it could mean something went wrong and one of the other ITSI anomaly detection methods would have found the problem. But you want more. You want to link KPIs together ad hoc without changing your ITSI production instance to find new types of anomalies.

For example, the IT admin of a website expects the ratio of the count of 500s to that of 400s to stay more or less the same. If this ratio deviates a lot, it means that something went wrong and they have to find the root cause, which could be a new buggy release.

Some of the behaviors are easy to discover and infer, while most of them not so much. Also as the environments get larger, the behaviors become more complex and dynamic. This is where machine learning saves the day.

Once we learn a behavior, the anomaly detection is the easy part. We just need to monitor the difference between the actual value of the KPI of interest versus what our machine learning model predicts and watch for significant deviations. Our MLTK shows several examples of this difference between an expected value and an observed value—check out the Predict Numeric Fields Assistant and look for the panel called Residuals Line Chart.

Too many words, no pretty graphs? Let's see in practice how we can use MLTK to create predictive KPIs and monitor services and entities in a new way inside ITSI.

Our Test Dataset

For this experiment, we use data of a customer call center. There are 10 kinds of events that are logged for this call center. Using these events, we are interested in the following KPIs that are reported on the events:

  1. Number of calls received (CR),

  2. Number of handled calls (CH),

  3. Number of logged in agents (ALI),

  4. Number of logged out agents (ALO),

  5. Time a caller spends until they drop the call (Queue Time), and

  6. Time a caller spends talking to an agent (TalkTime).

We have our KPIs already in ITSI as you can see above. The calculation window of these KPIs are 15 minutes and the searches run every 15 minutes as well.

We believe that the number of handled calls depends on the number of received calls, the number of available agents, and the amount of time callers talk to agents or wait before hanging up without talking to any agents. We could be using any KPIs or even any data in Splunk Enterprise here; this is just an example.

We need to get the data into MLTK to build a model of the behaviors we want to learn on.

Data Preparation

To get the KPIs that we care about, we use the following search, which might look a little intimidating but is very simple to reuse. In this search, we set service_name to Call Center, which is the name of our service. We pick the fields that we care about on the last line of the search and that’s it.

| join kpiid
[| inputlookup service_kpi_lookup
| rename _key as serviceid title as service_name
| eval kpi_info = mvzip('kpis._key', 'kpis.title', "==@@==")
| fields kpi_info service_name serviceid
| mvexpand kpi_info
| rex field=kpi_info "(?<kpiid>.+)==@@==(?<kpi_name>.+)"
| fields - kpi_info]
| search service_name="Call Center"
| timechart span=15m avg(alert_value) AS avg_value BY kpi_name limit=0
| fields _time, CH, CR, *Time, ALO, ALI

// pro tip:

Does your data come from multiple ITSI services or even from outside of ITSI? No problem. You can modify the above search to include all the data you need. After all, it’s a Splunk search. Sky is the limit.


Let's take this search and go to the magic land, erm, MLTK.

Building a Predictive Model

Prediction is all about creating an estimation about an unknown event. In this case, we are not estimating the future, we are estimating the present to uncover hidden anomalies. The workflow for predicting the future or the present is very similar so those of you who have read the previous ITSI and MLTK blog posts will see a lot of similarities (check out Part 1, Part 2, and Part 3).

We want to build a model that predicts number of handled calls, CH, which is a numeric field. Therefore, we choose the Predict Numeric Field assistant in the Machine Learning Toolkit and paste the above search into the search bar.

Splunk Showcase Predict Numeric Fields

We use a month of historical data to train our model. We will use the fitted model as our model of this behavior of the call center.

Once we hit the search button, the menus associated with fields get populated with the existing fields in the search result. We choose CH to be the field to predict—also known as the target field—and the rest as the predictor fields. We also choose Random Forest regressor as the algorithm. We also give our model a name that we can remember, CHRFModel, a Random Forest model for CH.

Leaving everything as default, we click on fit to get a sense of how good the model can be. The assistant splits the data into training data and test data. It uses the training data to fit a model and test data to test the accuracy of the model. It’s a machine learning best practice. You always want to test your model with data that your model didn’t see during fitting—basically testing it to see if it is really good in generalizing or it is just memorizing what it has seen.

After fitting is finished, the assistant returns six different ways of evaluating the model:

The top two charts provide two ways to compare the actual values of the target field vs the model's prediction. The chart on the left is an overlay of the two. This chart can be sorted in three ways: the original order of data, sorted by the actual values, and sorted by the predicted value. The scatter chart on the right is a quick way to understand how far this model is from perfection. The horizontal and vertical axes show the actual and predicted values of the target field. In a perfect scenario, all of the blue dots would fall on the yellow line, which is where predicted values are equal to the actual values. The predicted values are scattered around the yellow line fairly well, except the two ends of it. When the actual value is zero, we see the model’s prediction could be as high as 7. Also for actual values above 8, the model’s prediction flattens out. Not good.

Let’s look at the other charts. The next two charts are on the residuals between the actual and predicted values. The chart on the left shows the residuals in their original order and the one on the right is their histogram. The residuals of a perfect model are equal to zero. So, the sharper the histogram’s peak and the closer the peak to zero, the better the model. The residuals chart shows a lot of sudden spikes which could mean the model isn't able to capture some of the behavior of the target field. A closer look at these peaks show that they are cyclical. Maybe we can capture some of that in the model by including time as a model input too. We will try this in the next section.

The bottom left panel shows two important metrics about the accuracy of the model. R2 is a statistic that provides a measure of the model accuracy. The R2 of a perfect model is 1. A 0.91 R2 is a good start. Actually, in many applications we will be happy with R2 smaller than this. Keep in mind that there is no universal rule on what a good R2 is and depending on the application and how much precision is needed, a certain value of R2 can be either good or bad. Root Mean Square Error (RMSE) is another measure for model accuracy. A perfect model will have an RMSE equal to 0. The same goes for RMSE that there’s no golden value for it. Also keep in mind that RMSE depends on the scale of the target field. For example, if we multiply the target field by 10, our RMSE is also going to be multiplied by that number. That’s why we cannot have a golden value for RMSE.

Last but not the least, the bottom right panel shows the summary of the model. Depending on the selected algorithm, what this panel reports changes.

In the second part of this post, we’ll walk through how to improve the model, find anomalies and make the model available in other apps.


Posted by