At .conf19, we released the fifth major version of the Splunk Machine Learning Toolkit. This release was all about improving and enhancing toolkits' abilities to provide insights into your data, including a brand new outlier detection assistant, an update to our Machine Learning examples showcase page, an upgrade from Python 2.x to Python 3.x and a new System Identification algorithm.
New Content Highlights
Smart Outlier Detection Assistant
Outlier detection is by far the most popular use case in the industry. We constantly seek ways to offer a simple, yet rich and accurate way of helping you find outliers in your data, evaluate it and deploy it in your Splunk environment.
At .conf, we introduced a brand new assistant—the Smart Outlier Detection Assistant. It is not only smart by not having prejudice against your data’s statistical characteristics, but also charming with a new set of custom visualizations available.
Goodbye Python 2.7, Hello Python 3.7!
With Python 2.7 coming to its end of life, Splunk 8.0 is migrating to Python 3.7 and so is the Splunk Machine Learning Toolkit. Below, we will cover a few things to consider to make this transition as smooth as possible.
Revamped MLTK Showcase UI
The MLTK 5.0 is now easier to navigate with a new, modern showcase layout. You can easily filter the examples by industries and machine learning operations, or drill down to the examples for your specific use case.
System Identification Algorithm
Of course, we have a new algorithm for you to sink your teeth into in this major release. Sooner or later, we all notice how everything is dynamic and the effect of the past stays influential in present. Here’s a neural network-based algorithm that’ll make your life easier.
A Tour of the New Smart Outlier Detection Assistant
The Smart Outlier Detection Assistant will help you find numerical anomalies in your data with more accuracy than ever before. The assistant uses our popular algorithm DensityFunciton in the background and crafts SPL for you behind the scenes as you work with it. Custom visualizations and review steps give you more direct insight into your data and meaningful anomalies than ever before.
This is our second assistant in the Smart Assistant series. Let’s walkthrough how this new layout builds an ML Pipeline for you to detect outliers in your data.
In this stage, we select our data using either the Search or the Dataset option. For our example, we are loading the supermarket.csv dataset which comprises of 6 fields.
In the Learn stage, we have to set the parameters for the Density Function algorithm and run it. As you can see in the screenshot below:
- We are analyzing the quantity field from the supermarket dataset.
- Also, we want to analyze this field by shop_id. It’s important to do that because we know shops are not all equal.
- Next, we leave Distribution Type to “Auto” to have the algorithm to automatically find the best kind of distribution type and model for our data. If you know the distribution type of your data, this is where you can manually choose it.
- Lastly, Outlier Tolerance Threshold defines how many outliers you want to find in your metric - increase the threshold to get more outliers. Because the values selected for threshold are usually smaller than 0.01, if not much smaller, we designed the slider logarithmically so choosing smaller values wouldn’t be a problem.
After clicking on Detect Outliers, we get the results on the right hand side with the total number of outliers and a customizable visualization showing top three distribution with outliers.
Once we have our results, we can further probe into it to get relevant insights as shown by the review panel below:
- The Model Summary tab shows us the summary of the model for each segment (shop_id), created in the previous stage.
- The Cardinality Histogram gives us the holistic view of number of datapoints per segment.
- The Distribution Properties tab provides us with the histogram of mean and standard deviation of those segments with the same type of distribution.
- Lastly, the Outlier Analysis tab shows us the total number of outliers for quantity across different shops. Based on the screenshot below, looks like shops s1 and s2 are experiencing a greater than usual number of outliers.
Finally, we have the opeartionalization stage where we can set alerts, schedule ML fit and apply jobs or create dashbaords out of the results. Click Done to move to the Experiments listings page.
The latest releases of Machine Learning Toolkit and Python for Scientific Computing (PSC) apps are migrated to Python 3. The upgraded PSC app also comes with a new feature allowing you to add your favorite Python3 libraries to PSC2.0 with little effort so you can customize your algorithms down to the library level.
Keep in mind the following important items before upgrading to the new version of MLTK and PSC:
- MLTK 5.0 is only available on Splunk 8.0.
- MLTK 5.0 depends on PSC 2.0.
- Upgrading to Splunk 8.0 and MLTK 5.0 requires retraining of old models.
- The file extension for model files has been changed from .csv to .mlmodel.
- Users on 7.x can still use 4.4.1 for their existing use cases.
MLTK Showcase Refresh
We are excited to announce the new MLTK 5.0 showcase. It is now easier to navigate with a modern and intuitive UI. With the new showcase, you can easily filter examples by industries and machine learning operations, or drill down to the examples for your specific use case. In the new showcase page, you can access two new filters: ML Operation and Industry, with feature examples for each of them. We labeled each of the examples to save you time in identifying the purpose of each and you can also now search the examples with plain text.
System Identification Algorithm
Do you ever come across a use-case where you need to predict a target metric value using other metrics in your data, but the change in target metrics vs other metrics is not simultaneous, it usually happens after a delay of time? The new System Identification algorithms will help exactly this situation and allow you to predict a target metric using its past values, and using current and past values of other metrics provided as predictors. This algorithm uses neural networks under the hood.
For example, imagine your are trying to predict the “output drum pressure” metric for a steam generator using “input air,” “input disturbance,” “input fuel” and “input level ref” variables. In the screenshot below, let’s see how this will work if we try using Linear Regression to predict the “output drum pressure”.
Although the green line shows the output of the linear regression model can follow the general shape of the “output drum pressure,” its prediction is not ideal. The reason a regression algorithm wouldn’t fit a good model in such a use case is the nature of the use case. It’s true that we are predicting a numeric metric, but the characteristics of the use case tells us that past values of the target and predictor metrics are influential in the current value of the output.
In comparison, the screenshot below shows what happens if we try using the new System Identification algorithm for the same prediction. With the System Identification algorithm we can easily see an improvement in accuracy. The System Identification algorithms predicts the “output drum pressure” metric using its past values and by using current and past values of other 4 variables which can be seen in the screenshot as dynamic=10-10-10-10-10. Here 10 is the number of time slots in the past the algorithms has to consider for each variable.
We are thrilled to share with you these new improvements and as always keep an eye out for more coming soon. Read on below for additional resources on using and getting up to speed on the Splunk Machine Learning Toolkit.
For an in-depth look at how to use the MLTK, check out these webinars:
- Getting Started with Machine Learning
- Splunk's Machine Learning Toolkit: Technical Deep Dive and Demo Part 1
- Splunk's Machine Learning Toolkit: Technical Deep Dive and Demo Part 2
- Machine Learning in Action: Stop IT Events Before They Become Outages
Interested in trying out the Machine Learning Toolkit at your organization? Splunk offers FREE data science resources to help you get it up and running. Learn more about the Machine Learning Customer Advisory Program.
Catch up on all the conversations from #splunkconf19!