IT

ITSI and Sophisticated Machine Learning

Fellow Splunker Andrew Stein and I recently had a chance to work on a new machine learning project with a customer currently deploying the Splunk IT Service Intelligence (ITSI) solution. The customer asked us “How can we leverage machine learning to predict (find leading indicators) for an outage within our critical services with Splunk?”

So, we began the investigation into whether we could successfully apply machine learning to a Splunk ITSI deployment. Spoiler alert, we were successful.

We took the following steps along our journey:

1.     Explored the qualities of key performance indicators (KPI) in ITSI.

2.     Discovered the ITSI summary index.

3.     Improved searches with additional statistics.

4.     Fed the dataset into the Splunk Machine Learning Toolkit (MLTK).

5.     Built a machine learning model.

6.     Created alerts.

Most machine learning processes use information collected in the past to predict possible events in the future. We needed to create a compatible data set from the information in ITSI, so that we could feed it into a machine learning algorithm.

While exploring the ITSI data, we found that ITSI keeps a history of the KPIs and corresponding reports. This proved to be an excellent source of data for our project. Additionally, the customer had already defined KPIs for service availability and health. With access to the past information from these KPIs, we knew we could apply machine learning to the problem.    

Splunk ITSI Service Analyzer

Next, we found that the ITSI summary index stores knowledge about each service and corresponding KPI. We also found that the KPIs are numeric values, which was essential for the project.  Most importantly, we discovered that ITSI leverages the Splunk KV store to map the KPI to specific services, which expedites the search and retrieval process for details about each service.

We discovered the ITSI summary index through the ITSI Web Store Service. Here is the search we used:

\code
 index=itsi_summary
| join kpiid
[| inputlookup service_kpi_lookup
| rename _key as serviceid title as service_name
| eval kpi_info = mvzip('kpis._key', 'kpis.title', "==@@==")
| fields kpi_info service_name serviceid
    |  mvexpand kpi_info
| rex field=kpi_info "(?<kpiid>.+)==@@==(?<kpi_name>.+)"
| fields - kpi_info]
| search service_name="Web Store Service"
| timechart span=5m avg(alert_value) AS avg_value BY kpi_name
\code

 

We were thrilled with the results, because the search produced field-level details stored in the dataset (index=itsi_summary) and lookup (service_kpi_lookup). If you run this search in ITSI, you will see similar granularity in your deployment: 

Service

Entity

KPI Name

Alert Value

The results from the search we ran also included ServiceHealthScore, an additional KPI that ITSI creates whenever you define a new service. With these results, we had enough information to leverage machine learning to predict the future values of ServiceHealthScore.

While our first search was a great start, in order to apply a machine learning algorithm we needed additional statistics. We added min, max, median, and strings such as day of the week and hour of the day to create more compatible data sets to feed into the machine learning algorithm. 

\code
index=itsi_summary 
| join kpiid 
[| inputlookup service_kpi_lookup 
| rename _key as serviceid title as service_name 
| eval kpi_info = mvzip('kpis._key', 'kpis.title', "==@@==")
| fields kpi_info service_name serviceid 
    |  mvexpand kpi_info 
| rex field=kpi_info "(?<kpiid>.+)==@@==(?<kpi_name>.+)" 
| fields - kpi_info] 
| search service_name="Web Store Service" 
| timechart span=5m max(alert_value) AS max_value min(alert_value) AS min_value avg(alert_value) AS avg_value median(alert_value) AS mean_value BY kpi_name 
 | eval this_date_hour = strftime(_time, "%H") 
| eval this_date_day = strftime(_time, "%w") 
| eval this_date_day = this_date_day."_" 
| eval this_date_hour = this_date_hour."_"
\code

Now we had a comprehensive data set to feed into the machine learning algorithm to predict ServiceHealthScore. However, we needed to call out the field for the future, and build the future time into the data set. How could we do that?

The streamstats command was the best option. Specifically, the command |streamstats windows= 6. Our timeframe was 5-minute intervals; using six of them allowed us to predict 30 mins into the future. Then we created the new field ServiceHealthScoreFromFuture and pushed the values into the future for our testing purposes by reversing the time stamps with the reverse command:

\code
index=itsi_summary 
| join kpiid 
[| inputlookup service_kpi_lookup 
| rename _key as serviceid title as service_name 
| eval kpi_info = mvzip('kpis._key', 'kpis.title', "==@@==")
| fields kpi_info service_name serviceid 
    |  mvexpand kpi_info 
| rex field=kpi_info "(?<kpiid>.+)==@@==(?<kpi_name>.+)" 
| fields - kpi_info] 
| search service_name="Web Store Service" 
| timechart span=5m max(alert_value) AS max_value min(alert_value) AS min_value avg(alert_value) AS avg_value median(alert_value) AS mean_value BY kpi_name 
 | eval this_date_hour = strftime(_time, "%H") 
| eval this_date_day = strftime(_time, "%w") 
| eval this_date_day = this_date_day."_" 
| eval this_date_hour = this_date_hour."_"
|reverse
|streamstats window=6 current=f first(‘max_value: ServiceHealthscore’) as ServiceHealthScoreFromFuture
|reverse
\code

Now that our data set  was ready, we leveraged the Splunk Machine Learning Toolkit to configure the machine learning algorithm through a few simple clicks. We chose the MLTK assistant Predict Numeric Fields and the LinearRegression algorithm:

The results looked great, but we needed to normalize the data, so we used the MLTK Preprocessing Steps:

We selected the field we wanted to predict (ServiceHealthScoreFromFuture) and used the normalized fields (starting with SS_, _time , this_date_day, and this_date_hour) to create a model with the standard deviation algorithm and named it my_ITSI_WebService_Test.e:

normalized fields

Here are the results after fitting our model:


The model appeared comprehensive:

  1. The prediction was in line with the actual data.
  2. The histogram looked relatively consistent on both sides of the split. There were some values that we needed to investigate to confirm that they were of an anomalous nature. 
  3. The RStatistic was very high. The RStatistic is the result of the explained variation divided by the total variation based on the fit of the model. The RStatistic is a fraction between zero and one that measures how well the model accounts for the variation. The closer the fraction is to 1, the better the model fits the data. 

Next, we used the alert function in MLTK to schedule an alert when the value of ServiceHealthScoreFromFuture is less than 70. We utilize 70 as an example in this scenario to alert us when the Predicted Web Store Service HealthScore is presenting a value which is 10 points outside our known good or performant range of 80-100:

We scheduled our alert to run every 30 minutes, however you can run this every 5min or on a timeframe that makes sense based on your individual needs. Use the following search to run an alert on different time intervals against the applied model:

\code
index=itsi_summary
| join kpiid
[| inputlookup service_kpi_lookup
| rename _key as serviceid title as service_name
| eval kpi_info = mvzip('kpis._key', 'kpis.title', "==@@==")
| fields kpi_info service_name serviceid
    |  mvexpand kpi_info
| rex field=kpi_info "(?<kpiid>.+)==@@==(?<kpi_name>.+)"
| fields - kpi_info]
| search service_name="Web Store Service"
| timechart span=5m max(alert_value) AS max_value min(alert_value) AS min_value avg(alert_value) AS avg_value median(alert_value) AS mean_value BY kpi_name
| eval this_date_hour = strftime(_time, "%H")
| eval this_date_day = strftime(_time, "%d")
| eval this_date_day = this_date_day."_"
| eval this_date_hour = this_date_hour."_"
| reverse
| streamstats window=6 current=f first("max_value: ServiceHealthScore") as ServiceHealthScoreFromFuture
| reverse  | apply my_ITSI_WebService_Test_StandardScaler_0 | apply "my_ITSI_WebService_Test" | eval residual = 'ServiceHealthScoreFromFuture' - 'predicted(ServiceHealthScoreFromFuture)' | table "ServiceHealthScoreFromFuture", "predicted(ServiceHealthScoreFromFuture)", residual, "_time" "SS_avg_value: 4xx Errors" "SS_avg_value: 5xx Errors" "SS_avg_value: Application Response Times" "SS_avg_value: CPU Load: %" "SS_avg_value: Corporate Website Requests" "SS_avg_value: End User Response Times" "SS_avg_value: ServiceHealthScore" "SS_avg_value: Web Store Checkout Page Event Status" "SS_avg_value: Web Store Event Count" "SS_avg_value: Web Store Login Page Event Status" "SS_max_value: 4xx Errors" "SS_max_value: 5xx Errors" "SS_max_value: Application Response Times" "SS_max_value: CPU Load: %" "SS_max_value: Corporate Website Requests" "SS_max_value: End User Response Times" "SS_max_value: ServiceHealthScore" "SS_max_value: Web Store Checkout Page Event Status" "SS_max_value: Web Store Event Count" "SS_max_value: Web Store Login Page Event Status" "SS_mean_value: 4xx Errors" "SS_mean_value: 5xx Errors" "SS_mean_value: Application Response Times" "SS_mean_value: CPU Load: %" "SS_mean_value: Corporate Website Requests" "SS_mean_value: End User Response Times" "SS_mean_value: ServiceHealthScore" "SS_mean_value: Web Store Checkout Page Event Status" "SS_mean_value: Web Store Event Count" "SS_mean_value: Web Store Login Page Event Status" "SS_min_value: 4xx Errors" "SS_min_value: 5xx Errors" "SS_min_value: Application Response Times" "SS_min_value: CPU Load: %" "SS_min_value: Corporate Website Requests" "SS_min_value: End User Response Times" "SS_min_value: ServiceHealthScore" "SS_min_value: Web Store Checkout Page Event Status" "SS_min_value: Web Store Event Count" "SS_min_value: Web Store Login Page Event Status" "this_date_day" "this_date_hour" | where 'predicted(ServiceHealthScoreFromFuture)' > 70
\code

You can even add this alert to ITSI Notable Events and ITSI Analytics to notify your team of leading indicators that could impact critical services for the ITSI Web Store Service:

The customer was amazed that we could leverage KPIs in ITSI to create a predictive model in MLTK and use the dual alert systems in MLTK and the ITSI Notable Event Framework to give the customer team of leading indicators of possible service degradation. 

We hope this story helps you incorporate new methods into your own IT Operations Analytics journey.

Happy Splunking!

Nate Smalley
Posted by

Nate Smalley

TAGS

ITSI and Sophisticated Machine Learning

Show All Tags
Show Less Tags

Join the Discussion