Monitoring Model Drift in ITSI

I’m sure many of you will have tried out the predictive features in ITSI, and you may even have a model or two running in production to predict potential outages before they occur. While we present a lot of useful metrics about the models’ performance at the time of training, how can you make sure that it is still generating accurate predictions? 

Inaccuracy in models as the underlying data or systems change over time is natural. We usually recommend that you retrain the predictive models in ITSI on a regular basis to make sure they are current.

In this blog we will talk about some strategies for monitoring your models in ITSI for model drift. This is the idea that the predictive models will become less accurate over time as the rules that were generated originally no longer match the data they are applied to.

Identifying Your Models

ITSI has a number of commands and macros to help you keep on top of your predictive models. The first port of call for identifying the services that have predictive enabled is the | getservice command as shown below:

| getservice 
| search algorithms=*itsi_predict_*
| table serviceid identifying_name algorithms

Here we are filtering on the results to return only the services that have active predictive models. Note that the predictive models are stored in the lookups folder of ITSI, so you may also need to check that they are still there…

Monitoring Model Drift in ITSI

Now because the algorithms field is pretty dense I thought I would break out some of the key bits of information below:

Monitoring Model Drift in ITSI

Note that kpiModelsCreatedAt is in epoch time including the milliseconds.

Using the model name from the JSON array will help us find the actual predictive models that ITSI is using under the hood! ITSI trains a host of models when you save a predictive model for any given service:

  • A standard scaler model to ensure that all of your KPIs are counted equally
  • A predictive model for the average future health score
  • And a predictive model for the worst future health score

All three models are an extended version of the model ID from the JSON: for example, the average health score predictor is the model ID appended with _avg. Each model can be examined using the | summary command from the MLTK as below.

Monitoring Model Drift in ITSI

Another important point – especially because we have scaled the input KPIs with a standard scaler – is that for each of the gradient boosting, random forest and linear regression algorithms the higher the coefficient or importance from the model the higher the impact of that feature on the models’ prediction. In other words, if a KPI has a big value in the model summary then it has a big impact on future health score.

Show Me The Predictions!

Now that we have found our ITSI models it’s time to check if they are still operating well. Thankfully this is all pretty easy using the apply_model macro that ships with ITSI. Using the service ID and the model ID we can run this macro against our data in ITSI to generate some predictions as below:

| table _time next30m_avg_hs predicted(next30m_avg_hs)

This returns our actual values and the predictions to make sure everything is working as hoped:

Monitoring Model Drift in ITSI

Validate The Model

Next up we could either eyeball the data that gets returned to us to figure out if the model is still accurate, or instead we can calculate some statistics that should quantify the accuracy for us.

First up we’ll have a look at the R squared statistic, which is essentially a measure of accuracy – you can kind of read this measure as a percentage (negative values mean truly awful predictions!). Calculating this value can be done with the score command form the MLTK as below:

| table _time next30m_avg_hs predicted(next30m_avg_hs)
| score r2_score next30m_avg_hs against predicted(next30m_avg_hs)

In our case here we are still hitting pretty good accuracy, which is probably down to my test instance being populated by cyclical dummy data feeds… You could at this point set up an alert against this statistic to notify someone if the accuracy drops below a certain value, maybe suggesting that they go and re-train the predictive model. For me a good rule of thumb is that an accuracy above 0.7 is good enough to run in production, but this very much depends on the importance of the service and the risk appetite for poor predictions.

Monitoring Model Drift in ITSI

Another useful view on the predictions is to see how far out they are compared to the actual values we get in the data. We can use some simple statistics here to see if we have any unusual predictions compared to what we have seen historically based on the cyclical statistical forecasts and anomalies blog series:

| table _time next30m_avg_hs predicted(next30m_avg_hs)
| eval residual=next30m_avg_hs-'predicted(next30m_avg_hs)'
| table _time residual
| eventstats avg(residual) as avg stdev(residual) as stdev
| eval lower_bound=avg-3*stdev, upper_bound=avg+3*stdev
| table _time residual lower_bound upper_bound

This will help you identify the points where your predictive model was less accurate than expected, either missing a service degradation or predicting degradation when in fact none occurred.

Monitoring Model Drift in ITSI

Bringing It All Together

Now that we’ve explored some of the ways you can check on the accuracy of your production models in ITSI it should be fairly straightforward to put your searches into a dashboard to display some current metrics about the models’ accuracy. 

Monitoring Model Drift in ITSI

Although we have focussed entirely on ITSI in this blog these techniques could easily be applied to any model in Splunk that uses supervised learning. I’d encourage you to read more here about some other approaches to monitoring model drift in Splunk or to check out some of our other content about using machine learning to augment your ITSI instance.

Happy Splunking!

Greg is a recovering mathematician and part of the technical advisory team at Splunk, specialising in how to get value from machine learning and advanced analytics. Previously the product manager for Splunk’s Machine Learning Toolkit (MLTK) he helped set the strategy for machine learning in the core Splunk platform. A particular career highlight was partnering with the World Economic Forum to provide subject matter expertise on the AI Procurement in a Box project.

Before working at Splunk he spent a number of years with Deloitte and prior to that BAE Systems Detica working as a data scientist. Ahead of getting a proper job he spent way too long at university collecting degrees in maths including a PhD on “Mathematical Analysis of PWM Processes”.

When he is not at work he is usually herding his three young lads around while thinking that work is significantly more relaxing than being at home…

Show All Tags
Show Less Tags