MACHINE LEARNING

What’s New in the Splunk Machine Learning Toolkit 4.1

Another great release of the Splunk Machine Learning Toolkit (MLTK) is hot off the press and ready for you to download! As we start our new product cycle on path to .conf19 we continue to focus on you—the user—and making machine learning-driven outcomes accessible, usable, and valuable to you. This release is all about addressing your feedback through our Machine Learning Beta Program, and the many .conf18 customer meetings.

If you are one of the many customers new to Splunk or the MLTK, we're invested in your success with our new product documentation, starting with 4.1 and continuing through the year. Check back often!

  1. Welcome to the MLTK  — New introduction to get you up and running with the MLTK.
  2. MLTK FAQ — One stop shop for common MLTK questions and best practices.
  3. New to Splunk — A quick review for new users around Splunk concepts, not a substitute for our amazing EDU classes!


For our more mature MLTK users, thank you for your feedback and your amazing .conf18 presentations! We have included many of your most frequently asked-for items around making machine learning more usable and in Splunk with our new release, and expect more as we continue on this journey with our customers and partners.

  1. A new Imputer algorithm to speed your journey from messy data with missing values to a machine learning model driving business outcomes.
  2. HashingVectorizer as an alternative to TFIDF that helps you convert large text features to numerical values for machine learning quickly.
  3. And numerous usability enhancements like:
    • PCA can auto tune the optimal number of components (k) to reduce your data to.
    • Local Outlier Factor can optionally tell you the unsupervised anomaly strength of each point 
    • BoxPlot macro to reduce your SPL writing for using this fabulous data visualization


Splunk Admins, you rock! We are investing in making our products delightful and insightful for you too.

  1. ML-SPL Performance App has been updated, so you can estimate the impact of machine learning on your Splunk infrastructure. 
  2. Numerous bug fixes prioritized by customer feedback; thank you!
     


A Deeper Dive Into Some of the New Features of MLTK 4.1

Imputer

Do you have messy machine data? Of course you do! One of those messy problems is missing data values dragging you down when all you want to do is make a great graph or build an amazing machine learning model, but |fit or |apply has dropped rows of your data during its internal automated data prep stages and you have less data than you thought. How can you quickly get back to solving the problem at hand and spend less time dealing with missing values?

The Imputer algorithm is a preprocessing step for replacing missing data with substitute values, either estimated or based on other statistics or values in the dataset.

Historical search to learn:

host="web_server"
| eval bytes_missing=if(random() % 3 = 0, null, bytes)
| table _time, file, status, bytes_missing
| fit Imputer bytes_missing strategy="most_frequent" into bytes_missing_frequent
| eval imputed=if(isnull(bytes_missing), 1, 0)
| fields - bytes_missing
| rename Imputed_bytes_missing as bytes_imputed

Real-time search to apply:

host="web_server"
| eval bytes_missing=if(random() % 3 = 0, null, bytes)
| apply bytes_missing_frequent
| eval imputed=if(isnull(bytes_missing), 1, 0)
| fields - bytes_missing
| rename Imputed_bytes_missing as bytes_imputed


HashingVectorizer

Want to build awesome machine learning models using large text fields like Error_Description or Ticker_Content_From_A_Human on big data, but running into the large memory issues with TFIDF?

TFIDF uses a lot of memory to create a large dictionary of all terms (ngrams and words) and expands the Splunk search events with hundreds of addition fields per events. If you've already hit memory limits, you should meet your new best friend—HashingVectorizer.

Quick word to the wise—you're trading the in-memory interpretable term dictionary (using fit/apply) with TFIDF for a non-interpretable, user-specified number of columns in Splunk search events (fit only!), saving you a ton of memory if you don’t need to know term frequencies and just want outcomes.

For an in-depth look at how to use the MLTK, check out these webinars:


Learn how Splunk customers are using the Machine Learning Toolkit to generate benefits for their organizations, including Hyatt, the University of Nevada, Las Vegas (UNLV), and TransUnion.

Interested in trying out the Machine Learning Toolkit at your organization? Splunk offers FREE data science resources to help you get it up and running. Learn more about the Machine Learning Customer Advisory Program.

Manish Sainani
Posted by

Manish Sainani

Manish is responsible for Advanced Analytics & Machine Learning Portfolio for Splunk across our Enterprise, Cloud and Solutions products including Splunk Enterprise, Cloud, Machine Learning Toolkit (MLTK) and IT Service Intelligence.

 

TAGS

What’s New in the Splunk Machine Learning Toolkit 4.1

Show All Tags
Show Less Tags

Join the Discussion