Hunting for Detections in Attack Data with Machine Learning

As a (fairly) new member of Splunk’s Threat Research team (STRT), I found a unique opportunity to train machine learning models in a more impactful way. I focus on the application of natural language processing and deep learning to build security analytics. I am surrounded by fellow data scientists, blue teamers, reverse engineers, and former SOC analysts with a shared passion and vision to push the state of the art in cyber defense. STRT has collected real-world and simulated attack data that allows me to not only use machine learning to discover attack activity but identify how to transform insights into detections for the benefit of our customers.

A recent exercise using machine learning (ML) to hunt threats in Windows audit logs containing traces of post exploit kits illustrates that even small amounts of attack data can create new analytic opportunities. I wanted to explore which dual-use utilities (i.e. living off the land tools) were leveraged by exploit kits Meterpreter and Koadic to perform discovery, lateral movement, and other actions. Rod Soto, a prolific threat engineer and Splunker, curated a dataset containing Windows logs that captured artifacts (e.g. process creations, logons) of these tools.

It is tempting as a Data Scientist to create a model with minimal guidance and see what anomalies appear since some types of models (like Deep Neural Networks) provide a degree of feature engineering that may not need expert intuition in honing the model. This, however, is a recipe for fruitless investigations, especially for end-users. There is no shortage of unusual sequences of events in the normal course of machine operation. Therefore, we need to provide some guidance for the model, but not be overly prescriptive and miss detection opportunities.

I leveraged Rod’s guidance on what logs and actions to analyze (logons and process creation) to focus the model on a subset of data with the highest return on investment. I would call this a known unknown search: we have an idea of when the attack occurred and which logs likely contain artifacts of the attack, but not necessarily how the attack will appear (if at all). I trained a deep learning anomaly detector to find unusual collections of process creations from the Windows system folder (e.g. C:\Windows\). The specific model I used is called an autoencoder, which learns to compress the input data, process creation counts in this case, into a low dimensional space and then recreate the original input so that the reconstruction is as close to the original as possible. Unusual collections of process counts in these periods, such as excessive icacls.exe invocations, may be best explained by the operation of an attack and not normal OS activity. My primary tools for this task are TensorFlow, Jupyter, and Python. An example notebook is available on our security content repository. Below is a screenshot of a notebook that contained my investigation.

_{Snapshot of a Jupyter notebook. The above cell finds the most anomalous windows of activities. In this hour, we observe many processes launched with executables from C:\Windows. Note that we see many executables that are leveraged by attackers (msiexec, net, icacls, rundll, etc.).}

The model quickly identified two very unusual behaviors of these tools concerning process creation. First, we found that the tools created and executed an excessive number of processes from Windows Temp. Second, since Meterpreter and Koadic reside in memory, many of their actions require launching taskhost. We confirmed that the number of either processes created from Windows Temp or the number of taskhost and taskhostex invocations by the exploit kits was significantly more than what was observed in our research of normal Windows activity. Detections based on this investigation are now part of security content:

Name

Technique ID

Tactic

Description

Excessive number of taskhost processes

T1059

Execution

This detection targets behaviors observed in post exploit kits like Meterpreter and Koadic that are run in memory. We have observed that these tools must invoke an excessive number of taskhost.exe and taskhostex.exe processes to complete various actions (discovery, lateral movement, etc.).

Excessive number of distinct processes created in Windows Temp folder

T1059

Execution

This analytic will identify suspicious series of process executions. We have observed that post exploit framework tools like Koadic and Meterpreter will launch an excessive number of processes with distinct file paths from Windows\Temp to execute actions on objective. This behavior is extremely anomalous compared to typical application behaviors that use Windows\Temp.

_{Examples of processes launched from C:\Windows\Temp. In this twenty-minute block, we see 55 distinct process paths. Unusual for sure.}

I took away some key learnings from this exercise. Unsupervised machine learning tasks like anomaly detection for security can be both powerful and efficient if the dataset is focused on the attack window and you have a general, but not exact, idea of what to look for. Data scientists constantly struggle with noise and trying these techniques with little supervision over large numbers of machines without a known attack will surely result in a lot of false positives because anomalies happen. It may be more effective to take attack data as a starting point to then generalize to find novel threats. Therefore, we are not sending SOC analysts on wild goose chases, but rather focus on investigating real threats. We will continue to leverage our attack datasets for this ML-based hunting and periodically post interesting findings on our blogs. See you then!

----------------------------------------------------
Thanks!
Michael Hart

Style

two-column

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

Security

12 Minute Read

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

By analyzing new domain registrations around major real-world events, researchers show how fraud campaigns take shape early, helping defenders spot threats before scams surface.

Security

4 Minute Read

When Your Fraud Detection Tool Doubles as a Wellness Check: The Unexpected Intersection of Security and HR

Behavioral analytics can spot fraud and burnout. With UEBA built into Splunk ES Premier, one data set helps security and HR reduce risk, retain talent, faster.

Security

1 Minute Read

Splunk Security Content for Threat Detection & Response: November Recap

Discover Splunk's November security content updates, featuring enhanced Castle RAT threat detection, UAC bypass analytics, and deeper insights for validating detections on research.splunk.com.

Security

2 Minute Read

Security Staff Picks To Read This Month, Handpicked by Splunk Experts

Our Splunk security experts share their favorite reads of the month so you can follow the most interesting, news-worthy, and innovative stories coming from the wide world of cybersecurity.

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Security

10 Minute Read

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Uncover CastleRAT malware's techniques (TTPs) and learn how to build Splunk detections using MITRE ATT&CK. Protect your network from this advanced RAT.

Security

12 Minute Read

AI for Humans: A Beginner’s Field Guide

Unlock AI with the our beginner's field guide. Demystify LLMs, Generative AI, and Agentic AI, exploring their evolution and critical cybersecurity applications.

Security

5 Minute Read

Splunk Security Content for Threat Detection & Response: November 2025 Update

Learn about the latest security content from Splunk.

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

Security

3 Minute Read

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

The OneCisco approach is not about any single platform or toolset; it's about fusing visibility, analytics, and automation into a shared source of operational truth so that teams can act decisively, even in the fog of crisis.

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Security

5 Minute Read

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Explore how digital sovereignty shapes resilient strategies for European organisations. Learn how to balance control, compliance, and agility in your data infrastructure with Cisco and Splunk’s flexible, secure solutions for the AI era.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Hunting for Detections in Attack Data with Machine Learning

Related Articles