.CONF & SPLUNKLIVE!

What’s New in the Splunk Machine Learning Toolkit 4.0 (Part 1)

It’s .conf18 season and we're launching the Splunk Machine Learning Toolkit (MLTK) Version 4.0 with lots of new updates. This release is focused around three core tenets:

  1. A first-class machine learning experience for Splunk customers; 
  2. An open platform for sharing algorithms; and
  3. An extensible platform that integrates with popular machine learning runtimes and libraries.

The release includes new features for managing and running experiments, new algorithms from updated libraries, an open source community, new add-ons for 3rd party solutions, and more!

Let’s have a quick look at these new features:

  • Showcase Refresh: New Examples for MLTK
  • Experiment Management Framework: History
  • New ML-SPL Command: Score
  • New Cross-validation Option: K-fold
  • New Algorithms: LocalOutlierFactor and MLPClassifier with partial fit
  • New User Interface for MLTK Configuration
  • Splunk’s Open Source Community for MLTK Algorithms on GitHub

We have also have launched new add-ons for MLTK 4.0:

  • Splunk MLTK Connector for Apache Spark™ (Limited Availability Release)
  • Splunk MLTK Container for TensorFlow™ (via Splunk’s Professional Services)

Let’s go through these updates in more detail. We also have have all the updates from this 4.0 release summarized in a quick video, "What's New in Splunk Machine Learning Toolkit Version 4.0."

Showcase, Refreshed

Today, the MLTK Showcase has more than 30 demos and video tutorials for users to quickly find examples of common machine learning use cases in Splunk, and we just added more examples and demos for you! More is always better!

What’s so special about these demos?

In this refresh, we added examples that are derived from machine data, which have been used in our machine learning blog posts and other ML projects that we’ve been involved with. The demos are based on real use cases that we regularly hear about from our customers. 

Monitor Your Experiments – Automatically

Remember that one time you found the best parameters and fields for your awesome predictive model, but you kept experimenting and forgot your golden set-up? We took care of that in MLTK 3.2. We made it possible to track a wealth of useful information in your experiments such as the algorithms, their parameters, notes on each experiment, as well as information on the model’s accuracy.

What’s New in the Experiment History?

As a new feature, we are not only automatically monitoring the accuracy of your model at the time the experiment was run, but also the scheduled model retrainings; all under the same Experiment History tab.



Score!

We are the creators of our models and we should have the right to judge them too, right? To help you be judgemental, we are introducing a new ML-SPL command: | score.

The score command provides a full set of statistical tests for validating your model predictions, using the familiar ML-SPL grammar. Forget about writing sophisticated SPL snippets and chunky macros for scoring your models. With the score command, you just need to “put that in your |” to easily validate your models.

"The latest release of Splunk Machine Learning Toolkit makes it significantly easier to process large amounts of data and find patterns to see what's right or wrong. Splunk’s continued evolution of the Experiment Management Framework, including new tools to help validate our machine learning models, streamlines the complicated process of operationalizing machine learning." - Sundaresh Ramanathan, Director, IT Operations Analytics, Kinney Group, Inc.


A Different Type of Scoring: Cross-Validation

One of the dark parts of building a machine learning model is choosing training and test datasets. You always wonder if a model you create is overfitted, biased, or of low quality due to your data selection.

In the workflows for the Predict Numeric Field and Predict Categorical Field Assistants, you have always had the option of randomly splitting your data into two subsets for training and testing. Every time you click on the fit button, you get different sets of test and train data from the previous times—that’s already providing peace of mind.

In MLTK 4.0, we added a new feature so that you can cross-validate your models right from the search queries that train them. Simply specify the number of cross-validation folds you want by setting the fit command’s new parameter, kfold_cv.

For classification modeling, you will get the weighted confusion matrix on each fold:

For regression modeling, you will get the negative mean squared error and the R2 value on each fold:



New Algorithms? Ta-da!

Are you the type who’s always looking for a needle in a haystack? You’ll find a new method to do so in MLTK 4.0.

Local Outlier Factor is an unsupervised anomaly detection algorithm that highlights the anomalies with respect to its neighboring data (remember to scale your data first!). With this algorithm, you can find data points that are isolated from the rest. This could mean finding a server that performs differently from the others or tagging that visitor to your website who shows strange behavior, or triggering timely maintenance of a wind turbine that generates unknown vibration frequencies.

MLPClassifier with Partial Fit

One of the great things about doing machine learning in Splunk is the power of operationalization that the platform provides. Today, we can easily retrain our models using Splunk scheduled searches. Beyond that, SGDClassifier and SGDRegressor are two linear algorithms that also support partial fit, which makes it possible to incrementally learn from the incoming data vs. re-training on the entire dataset, which can be computationally expensive and time-consuming.

MLTK 4.0 makes the incremental learning of a nonlinear model also possible by supporting partial fit for a neural network algorithm, MLPClassifier. Did you hear that? Neural network—yes we finally have NNs in MLTK!

But that's not all. Looking for more on what's new in the Splunk Machine Learning Toolkit 4.0? Check out part two of this blog series.  


Follow all the conversations coming out of #splunkconf18!

Manish Sainani
Posted by

Manish Sainani

Manish is responsible for Advanced Analytics & Machine Learning Portfolio for Splunk across our Enterprise, Cloud and Solutions products including Splunk Enterprise, Cloud, Machine Learning Toolkit (MLTK) and IT Service Intelligence.

 

Join the Discussion