We’ve just rolled out an updated version of the Splunk Machine Learning Toolkit 3.3, which builds on the capabilities we’ve delivered in the toolkit since the 3.2 release in May 2018. In this release, we've added new functionality that will help with unstructured text data analysis and provide more data standardizing options.
The Machine Learning Toolkit Version 3.3 has added two new key features that I’d like to quickly review:
- TF-IDF as a pre-processing option
- Addition of a new algorithm - Robust Scaler
For Your Unstructured Text Data
If you have unstructured text data in Splunk that you’d like to analyze and pre-process before using it for modeling, then have a look at TF-IDF as a pre-processing option.
What is it?
TF-IDF (Term Frequency-Inverse Document Frequency) is a text mining algorithm in which one can find relevant words in a document. In information retrieval,TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
What does it do?
TF-IDF breaks down a list of documents into words or characters. It then finds the importance of each word/character in each document based on what we can categorize in that document as we do using tags.
When using machine learning in Splunk to find anomalies, predict or estimate a system’s response, or cluster to discover behavior, you often have some set of fields that are non numeric—ServiceNow ticket information written by a human (“My service is slow!”), or the country code for an iplookup, or even the search field from index=audit. TFIDF converts such fields into valuable numeric intelligence for machine learning to “learn” from.
Check out the behaviors of searches being run on Splunk! Run the following search over the last 7 days for example:
| fit TFIDF search into searchTFIDF
| fit DBSCAN search_tf* | fit LogisticRegression cluster from search_tf* into PersistingSearchBehavior
| stats count max(total_run_time) min(total_run_time) by cluster
Do you have interesting clusters of searches taking an unusual amount of run time? Do you have a cluster labeled -1, with the anomalous searches? You can quickly investigate these clusters by using the following:
| apply searchTFIDF
| apply PersistingSearchBehavior as cluster
| where cluster=-1
| table cluster,total_run_time, search , status, provenance
For more on live TFIDF examples, check out the DGA App from Splunkbase for more examples of TF-IDF, and watch the DGA walkthrough video on our YouTube Channel!
More Standardizing Options
What is it and it's use case?
One of the best practices in building machine learning models is to standardize the data before any modelling (i.e. if you have two fields to be used for model training—temperature with a scale of 21 to 40 and memory size with a scale of 10,000 to 1,000,000 bytes—then it needs to be standardized to a comparable scale). Using Robust Scaler, one can standardize multiple fields at the same time.
How is it different from Standard Scaler?
Standard Scaler—which also ships with the Machine Learning Toolkit—helps with standardization, but it can be sensitive to outliers. On the other hand, Robust Scaler is not because it standardizes the data fields by scaling their median and interquartile range to 0 and 1, respectively, where as Standard Scaler does that using mean.
Learn how Splunk customers are using the Machine Learning Toolkit to generate benefits for their organizations, including Hyatt and the University of Nevada, Las Vegas (UNLV).
Interested in trying out the Machine Learning Toolkit at your organization? Splunk offers FREE data science resources to help you get it up and running. Learn more about the Machine Learning Customer Advisory Program.