MACHINE LEARNING

Cyclical Statistical Forecasts and Anomalies – Part 4

Remember when you wanted great alerts, so you read our past two blogs about cyclical statistical forecasts and anomalies? 

Hopefully, the techniques in those blogs gave you some great results. Here we’re going to show you another way of finding anomalies in your data using a slightly different technique.

Hang on, why do I need another technique?

The previous blog posts (part 1part 2 and part 3) presented a method that didn’t actually need the Machine Learning Toolkit – it was all core SPL. More importantly than that, however, the previous techniques presented a few drawbacks:

  • They needed to run on all of your data all of the time, or you had to force the persistence of a ‘model’ (as showcased in the blogs).
  • They struggle with data sources that had multiple variables.
  • They can be sensitive to local variance.

So the great folks in our MLTK product team decided that what we needed (drum roll) was the Probability Density Function!

The Probability Density Function is essentially a function that determines the probability of a value being in a certain range based on past information. In other words, it generates a baseline for your data. This makes it awesome for finding anomalies as it allows you to quickly determine if data sits in an expected range or not. Don’t worry too much about the mathematics behind the Probability Density Function (though if you want to know more you can read an introduction to the technique here).

Now it’s time to go and download the MLTK and get started finding anomalies in data.

So how do I use it?

Once you’ve downloaded and installed the MLTK lets grab some data. In this blog we’ve selected a dataset that ships with the MLTK:

| inputlookup cyclical_business_process.csv

As the name of the lookup data suggests you can see that the data is nicely cyclical and therefore well suited to anomaly detection. 

cyclical data

Now that we’ve found some suitable data we’re going to train a model using the fit command from the MLTK and the DensityFunction algorithm. Note that we could use the smart outliers detection assistant, but here we’re going to take you through how to do this using SPL.

As we are dealing with cyclical data we are going to train our model on the logons variable and split it by the hour of day and the day of week so that daily and hourly variance is accounted for. Once we have defined the hour of day and day of week variables we are going to train a density function model and save it into a lookup (df_cyclical_business_processes) so that it can be referred to in future searches. The Splunk search for this is below.

| inputlookup cyclical_business_process.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval DayOfWeek=strftime(_time, "%A")
| fit DensityFunction logons by “HourOfDay,DayOfWeek” into df_cyclical_business_processes

This fit command will also spit out a few extra values: IsOutlier(logons) and BoundaryRanges. Don’t worry too much about BoundaryRanges (that’s to do with the ranges used to find outliers), but the IsOutlier value will tell you if the data is an outlier or not.

cyclical data set

Now it just looks like we have a load of outliers? Don’t worry, we’ll show you how to tune the volume of outliers, but before we look at refining the number of outliers let’s have a quick look at what the model has produced. 

| summary df_cyclical_business_processes

cyclical business processes

This summary can be really useful for putting the outliers in context, and we’ll show you how to do this after the next search. For the time being note that for each day and hour span we have a mean, standard deviation (std) and also cardinality. Mean and standard deviation tell us a little bit about the distribution of the data at that time of day, and the cardinality tells us how many data points we used to train that bit of the model – the higher the value the better the model is likely to be.

So now that we’ve trained our model (and hopefully understood it a little bit) let’s apply it to some data.

| inputlookup cyclical_business_process.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval DayOfWeek=strftime(_time, "%A")
| apply df_cyclical_business_processes threshold=0.001
| table _time logons IsOutlier(logons)

cyclical data

Note that we have used the threshold argument here and have significantly less outliers than before. That is because the threshold tells Splunk what range to look for – 0.001 means look for anything in the least likely 0.1% of the data.

Finally, let’s bring it all together and use the model summary to enrich the results with added context using the search below.

| inputlookup cyclical_business_process.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval DayOfWeek=strftime(_time, "%A")
| apply df_cyclical_business_processes threshold=0.001 show_density=true
| where 'IsOutlier(logons)'>0
| eval HourAndDay=HourOfDay." ".DayOfWeek
| join HourAndDay [| summary df_cyclical_business_processes | eval HourAndDay=HourOfDay." ".DayOfWeek | table HourAndDay cardinality mean std]
| table _time logons ProbabilityDensity(logons) cardinality mean std
| eval distance_from_mean=abs(logons-mean), deviations_from_mean=abs(logons-mean)/std

MLTK logons

With this search we have filtered our results to only show the outliers, added in the mean, standard deviation and cardinality data from our model summary. Additionally, we have performed some calculations to determine how far our number of logons is from the mean and how many deviations away from the mean they are as well – this tells us a bit more about how extreme the outlier is. We’ve also used the show_density=true option to add in the probability density associated with the logons value. All of this data can be used to perform additional filtering or tuning of the data so that you are only alerting on the most extreme of outliers.

What if I have Multiple Variables?

Now that you’ve seen how to find anomalies on a single variable it’s fairly easy to find them in datasets that have multiple variables – all you need to do is fit and apply more DensityFunctions! 

This time we’re going to pick on a different dataset with a few variables in the MLTK, the app_usage.csv dataset. Because this dataset isn’t as densely populated as the cyclical_business_processes.csv data we’re going to break out our anomaly detection by weekdays and weekends rather than hour of day. 

| inputlookup app_usage.csv
| eval _time=strptime(_time, "%Y-%m-%d")
| eval DayOfWeek=strftime(_time, "%A")
| eval Weekend=if(DayOfWeek="Saturday" OR DayOfWeek="Sunday","Yes","No")
| fit DensityFunction ERP by "Weekend" into df_erp as outlier_erp
| fit DensityFunction Expenses by "Weekend" into df_expenses as outlier_expenses
| fit DensityFunction HR1 by "Weekend" into df_hr1 as outlier_hr1
| fit DensityFunction HR2 by "Weekend" into df_hr2 as outlier_hr2
| fit DensityFunction ITOps by "Weekend" into df_itops as outlier_ITOps
| fit DensityFunction OTHER by "Weekend" into df_other as outlier_other
| fit DensityFunction Recruiting by "Weekend" into df_recruiting as outlier_recruiting
| fit DensityFunction RemoteAccess by "Weekend" into df_remote_access as outlier_remote_access
| fit DensityFunction Webmail by "Weekend" into df_webmail as outlier_webmail
| eval anomaly_score=0
| foreach outlier_* [eval anomaly_score=anomaly_score+<<FIELD>>]

Cyclical density function

You can see that we have applied a DensityFunction algorithm to the app use metrics for a range of different applications and are then summing the outlier values for each. Alternatively, you could enrich your results and generate an anomaly score from another combination of metrics, such as probability density, distance from mean or deviations from the mean. 

If you’ve been following along with the blog you may notice that your results differ a little bit from the chart here, so a challenge task is to modify the thresholds of your searches to see if you can replicate the chart. As a clue check out the apps that have the highest variance and consider increasing the threshold to find anomalies against the number of logons. Finally, as another challenge – do you think it’s possible to find all the anomalies using a single fit command? Hint: the untable command might be useful for this…

Hopefully, now you’ve got a good idea of how to use the DensityFunction algorithm and can get started hunting down anomalies in your own data.

Until next time and happy Splunking,

Greg

Greg is a Machine Learning Architect at Splunk where he helps customers deliver advanced analytics and uncover new ways of insight from their data. Prior to working at Splunk he spent a number of years with Deloitte and before that BAE Systems Detica working as a data scientist. Before getting a proper job he spent way too long at university collecting degrees in maths including a PhD on “Mathematical Analysis of PWM Processes”. When he is not at work he is usually herding his three young lads around while thinking that work is significantly more relaxing than being at home…

Join the Discussion