A Smarter Way to Preprocess Your Data

In May we released the Splunk Machine Learning Toolkit (MLTK) version 5.2.  We’ve loved telling you about some of the great new features, including the most recent blog on DensityFunction. However, we know that before you can start experimenting with model-building algorithms such as DensityFunction, your data needs to be prepared for machine learning. Machine learning operates best when you provide clean data as the foundation for building your models. This is a common pain point for customers, which is why the Machine Learning Toolkit enables both self-guided and assisted data preprocessing options. In this blog, we’ll highlight the data preprocessing options available within the guided workflows of the MLTK Smart Assistants (and a selection of Experiment Assistants).

Preprocessing can transform your data into fields that give you better data experimentation results, higher quality models, and more usable visualizations. The Smart Assistants in the MLTK offer different preprocessing options within the Learn stage of their step-by-step workflows. Preprocessing steps include algorithms that reduce the number of fields, produce numeric fields from unstructured text, join or extract fields, or re-scale numeric fields. As with other aspects of MLTK guided modeling Assistants, any preprocessing steps taken also generate Splunk Search Processing Language (SPL) for you that can be viewed using the SPL button within each Assistant.

The Smart Clustering Assistant offers the option to use the StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms and is useful when the fields have very different scales.

The Smart Prediction Assistant offers the option to use FieldSelector to select the best predictor fields based on univariate statistical tests. Users can select modes including Percentile, K-best, False positivity rate, False discovery rate, and Family-wise error rate.

Both the Smart Clustering Assistant and Smart Prediction Assistant offer PCA and Kernel PCA preprocessing algorithms options. Use these preprocessing algorithms to reduce the number of fields by extracting new, uncorrelated features out of the data. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to display a scatterplot chart.

To learn more about these algorithms and other preprocessing options with the MLTK Assistants, check out our User Guide.

Ready to get started? Download the free Machine Learning Toolkit app today and see how you can leverage machine learning with your Splunk data!

This blog was co-authored by Kristal Curtis, Senior Software Engineer

Mohan Rajagopalan

Posted by