SECURITY

Hunting for Detections in Attack Data with Machine Learning

As a (fairly) new member of Splunk’s Threat Research team (STRT), I found a unique opportunity to train machine learning models in a more impactful way. I focus on the application of natural language processing and deep learning to build security analytics. I am surrounded by fellow data scientists, blue teamers, reverse engineers, and former SOC analysts with a shared passion and vision to push the state of the art in cyber defense. STRT has collected real-world and simulated attack data that allows me to not only use machine learning to discover attack activity but identify how to transform insights into detections for the benefit of our customers.

A recent exercise using machine learning (ML) to hunt threats in Windows audit logs containing traces of post exploit kits illustrates that even small amounts of attack data can create new analytic opportunities. I wanted to explore which dual-use utilities (i.e. living off the land tools) were leveraged by exploit kits Meterpreter and Koadic to perform discovery, lateral movement, and other actions. Rod Soto, a prolific threat engineer and Splunker, curated a dataset containing Windows logs that captured artifacts (e.g. process creations, logons) of these tools.

It is tempting as a Data Scientist to create a model with minimal guidance and see what anomalies appear since some types of models (like Deep Neural Networks) provide a degree of feature engineering that may not need expert intuition in honing the model. This, however, is a recipe for fruitless investigations, especially for end-users. There is no shortage of unusual sequences of events in the normal course of machine operation. Therefore, we need to provide some guidance for the model, but not be overly prescriptive and miss detection opportunities.

I leveraged Rod’s guidance on what logs and actions to analyze (logons and process creation) to focus the model on a subset of data with the highest return on investment. I would call this a known unknown search: we have an idea of when the attack occurred and which logs likely contain artifacts of the attack, but not necessarily how the attack will appear (if at all). I trained a deep learning anomaly detector to find unusual collections of process creations from the Windows system folder (e.g. C:\Windows\). The specific model I used is called an autoencoder, which learns to compress the input data, process creation counts in this case, into a low dimensional space and then recreate the original input so that the reconstruction is as close to the original as possible. Unusual collections of process counts in these periods, such as excessive icacls.exe invocations, may be best explained by the operation of an attack and not normal OS activity. My primary tools for this task are TensorFlow, Jupyter, and Python. An example notebook is available on our security content repository. Below is a screenshot of a notebook that contained my investigation.

Snapshot of a Jupyter notebook. The above cell finds the most anomalous windows of activities. In this hour, we observe many processes launched with executables from C:\Windows. Note that we see many executables that are leveraged by attackers (msiexec, net, icacls, rundll, etc.). 

The model quickly identified two very unusual behaviors of these tools concerning process creation. First, we found that the tools created and executed an excessive number of processes from Windows Temp. Second, since Meterpreter and Koadic reside in memory, many of their actions require launching taskhost. We confirmed that the number of either processes created from Windows Temp or the number of taskhost and taskhostex invocations by the exploit kits was significantly more than what was observed in our research of normal Windows activity. Detections based on this investigation are now part of security content:

Name

Technique ID

Tactic

Description

Excessive number of taskhost processes

T1059

Execution

This detection targets behaviors observed in post exploit kits like Meterpreter and Koadic that are run in memory. We have observed that these tools must invoke an excessive number of taskhost.exe and taskhostex.exe processes to complete various actions (discovery, lateral movement, etc.).


Excessive number of distinct processes created in Windows Temp folder

T1059

Execution

This analytic will identify suspicious series of process executions. We have observed that post exploit framework tools like Koadic and Meterpreter will launch an excessive number of processes with distinct file paths from Windows\Temp to execute actions on objective.  This behavior is extremely anomalous compared to typical application behaviors that use Windows\Temp.



Examples of processes launched from C:\Windows\Temp. In this twenty-minute block, we see 55 distinct process paths. Unusual for sure.

I took away some key learnings from this exercise. Unsupervised machine learning tasks like anomaly detection for security can be both powerful and efficient if the dataset is focused on the attack window and you have a general, but not exact, idea of what to look for. Data scientists constantly struggle with noise and trying these techniques with little supervision over large numbers of machines without a known attack will surely result in a lot of false positives because anomalies happen. It may be more effective to take attack data as a starting point to then generalize to find novel threats. Therefore, we are not sending SOC analysts on wild goose chases, but rather focus on investigating real threats. We will continue to leverage our attack datasets for this ML-based hunting and periodically post interesting findings on our blogs. See you then!

Michael Hart
Posted by

Michael Hart

Michael Hart is a security researcher focused on the intersection of artificial intelligence, natural language processing, and security. He has developed solutions using machine learning for data loss prevention, data governance, endpoint and network monitoring and alert triage. Before joining Splunk, Michael worked in research and engineering at Symantec, ad quality at Google and security analytics at Aetna. He holds a Ph.D. from Stony Brook University in computer science.

TAGS

Hunting for Detections in Attack Data with Machine Learning

Show All Tags
Show Less Tags

Join the Discussion