Splunk recently released the 4.2 version of the Machine Learning Toolkit (MLTK), featuring a new algorithm—the probability density function. This algorithm is used to determine where values of a data set are expected to fall, based on historical values. It can help you identify anomalous values for a particular data set. The implementation of this algorithm in the MLTK means that we can now leverage machine learning (ML) techniques for identifying outliers in security-related data.
There are many cases in which identifying these anomalies is useful in a security context. Splunk Enterprise Security Content Update (ESCU) contains several searches that look for spikes in various data that may be indicative of malicious activity in your environment. These searches currently use Splunk's computational capabilities to calculate the standard deviation for a set of data points and then look for values that exceed some multiple of that number. While this technique is useful and often sufficient, leveraging the new DensityFunction algorithm in the MLTK provides several advantages (take a look at this blog post on the Splunk Machine Learning Toolkit 4.2 for a deeper dive).
In release 1.0.38 of ESCU, we introduced MLTK versions of three of these searches, which are designed to look for spikes in SMB connections, unusually long command lines on your endpoints, unusually long DNS queries (which could be attributed to activity with machine-generated domain names), or misuse of the DNS protocol for nefarious purposes. Because we need two searches for each MLTK detection—one to build the model and the other to leverage it for detection—this resulted in a total of six new searches.
How the New ML Content Works
The new ML-related content in ESCU takes the form of six searches—three support searches that are used to create the ML models and three detection searches that use the models built by the support searches to look at new data and identify the outliers, relative to historical norms. The new searches are:
- Baseline of DNS Query Length - MLTK
- Baseline of SMB Traffic - MLTK
- Baseline of Command Line Length - MLTK
- DNS Query Length Outliers - MLTK
- SMB Traffic Spike - MLTK
- Unusually Long Command Line - MLTK
The first three searches use the MTLK “fit” command to build a model based on existing data. These searches must be run prior to the corresponding detection search, as the detection searches will fail if the models are not available. Once the models have been built, the detection searches will use the “apply” command to use the model to compare against incoming data and generate a notable event if it identifies an outlier.
Getting It All Up and Running
If you’ve never used ESCU before, go ahead and pull it down from Splunkbase and give it a try. This free subscription service provides you with Analytic Stories—themed security guides loaded with searches designed to help you secure your environment and investigate suspicious activity. It’s a simple install that will give you an interface to explore the content we provide. It’s designed to work with Splunk Enterprise Security, but you can explore the provided searches without it as well.
To use the new searches that leverage the DensityFunction algorithm in MLTK, you’ll need to make sure you have version 4.2 or greater of the MLTK installed on your search heads, in addition to version 1.4 or greater of Python for Scientific Computing (a required dependency). In addition, to use the MLTK commands "fit" and "apply" in ES, you’ll need to visit the “App Imports” configuration and follow the steps outlined here.
That’s it! You can now use the DensityFunction algorithm to hunt for the more subtle "tells" of malicious activity that are otherwise difficult to see.
Finding and Using the ML Content
The new ML-related searches in ESCU are peppered throughout various Analytic Stories and appear next to their original non-MLTK versions. If you’re running ES, the detection searches will show up in Content Management. You can quickly find them by navigating there and typing MLTK in the filter. You can further filter on the ES Content Update app, if needed, as well. From here, you can modify and execute the baseline searches. You can also enable the associated detection searches, which are, by default, scheduled to run every hour.
The searches building the models are set to run over your last 30 days' worth of data. You can edit the search to look over a larger period of time, which, generally speaking, will use more data in the construction of your model and give better results. However, this is not always the case—especially with data that is moving in a macro sense, in which case data over a shorter time frame may be more reflective of today’s “normal." Similarly, you will probably want to periodically rebuild the model to make sure it accurately reflects your current environment. Because these models are built on your data and everyone’s data is a little different, there is no one right answer for how much data to use, or how often to rebuild the model. However, 30 days of data is likely a good starting point, and you can adjust based on your results. If you plan to really dive in and leverage the new algorithm in MLTK, read more on how to use it in Splunk Docs.
Written by: Rico Valdez, Principal Security Researcher