Defending at Machine-Speed: Accelerated Threat Hunting with Open Weight LLM Models

Artificial intelligence is accelerating the pace of cyber operations through the improved convenience, scale, and automation of tasks. To keep parity with efficiencies attackers will gain, defenders must employ AI to equal effect. The core of this challenge in a Security Operations Center (SOC) is applying human analysts to the tasks where their attention is most needed. With their ability to quickly and accurately process and interpret human language, Large Language Models (LLMs) are well-suited to address some of these topics, and reduce the overall cognitive burden of analysts when applied strategically. By identifying high-value use cases and engineering effective solutions, we can streamline defensive operations to keep pace with the accelerating threat landscape.

With Splunk DSDL 5.2, the engineering task is now much easier. We can test and deploy locally-hosted LLMs alongside Splunk to directly support inference on our data, without the security or privacy concerns of internal data leaving the network boundary. To identify a high-value use case, we want to determine the bottlenecks that currently require in-depth analyst attention.

With the rise of living-off-the-land attack techniques, attackers perform actions using tools that are allowed, and already present, inside the environment. PowerShell, for example, is often as legitimately useful for administrators as it is for ill-intentioned attackers, and its use requires an analyst to investigate context before making a final determination on the nature of the code’s intent. This challenge also aligns well with one of the strengths of LLMs derived from understanding human language – code interpretability. By deploying multiple local LLMs alongside a Splunk environment, we will test their ability to classify the intent of PowerShell scripts. We can use this as a pre-screening step by applying a prediction label of ‘malicious’ or ‘benign’ on each event. By injecting a language model into later stages of a threat hunting workflow, we can reduce the initial classification time of an event by up to 99%[1]!

The Setup

The first step in evaluating the feasibility of this idea is to assess the out-of-the-box ability of Language Models to accurately classify the intention of PowerShell scripts. After installing DSDL 5.2, we created a deduplicated, labeled dataset with balanced classes, compiled from samples of prior research. We used a constant seed value to create a reproducible, pseudo-random sample from the dataset, to ensure a comparable evaluation of each model.

To get a baseline we tested three medium-sized, open-weight models (Llama3-8B, Gemma-7B, Mistral-7B), with the prompt: “You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Output only the category name in your answer.” This can be done via the LLM-RAG Assistant dashboards, or directly in SPL using the fit command.

To classify a large sample, you may need to increase the number of seconds for the fit algorithm in MLTK, by increasing the value of the max_fit_time setting.

The Result

On an identical random sample of 2,000 PowerShell Scripts (1k malicious / 1k benign), these off-the-shelf models performed quite well. High precision (few false positives), very high recall (very rare false negatives), and an average response time of <2 seconds per event are all promising metrics for implementing a local-model as an assistant to pre-classify interesting PowerShell events for our analysts.

Llama3 - 8B
Gemma 1.1 - 7B
Mistral 0.3 - 7B
Model
Precision
Recall
Accuracy
F1
Avg Response
Unclassified*
Llama3 - 7B
0.80
1.00
0.87
0.89
0.71 seconds
2.4%
Gemma 1.1 - 8B
0.70
1.00
0.78
0.82
1.99 seconds
3.5%
Mistral 0.3 - 7B
0.79
0.995
0.86
0.88
1.74 seconds
5.7%

*This percentage represents dropped samples that could not be definitively labeled based on the output

Key Findings

Model-in-the-Loop Hunting

To demonstrate the model-in-the-loop concept, we will create a set of ‘grey-area’ results that would require threat hunting, i.e., PowerShell activity that requires human-in-the-loop interpretation. We accelerate the threat hunting process by tagging events via LLM classification prior to analyst review. High model accuracy gives us confidence to expect limited false positives, and near-perfect recall leads us to expect few, to no, false-negatives.

To start we will adapt a PowerShell 4104 Hunting query from Splunk’s Threat Research Team, modifying the end of the query to select the field we want to submit for inference, and including the model prompt right in our SPL!:

index="powershell_class"

| eval DoIt = if(match(Content,"(?i)(\$doit)"), "4", 0)

| eval enccom=if(match(Content,"[A-Za-z0-9+\/]{44,}([A-Za-z0-9+\/]{4}

|[A-Za-z0-9+\/]{3}=

|[A-Za-z0-9+\/]{2}==)") OR match(Content, "(?i)[-]e(nc*o*d*e*d*c*o*m*m*a*n*d*)*\s+[^-]"),4,0)


| stats sum(Score) AS hunt_total by Filename, DoIt, enccom, compressed, downgrade, iex, mimikatz, rundll32, empire, webclient, syswow64, httplocal, reflection, invokewmi, invokecmd, base64, get, suspcmdlet, suspkeywrd, Content

| search hunt_total > 10

| sort -hunt_total

| rename Content as text

| fit MLTKContainer algo=ollama_text_processing model="llama3" prompt="You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." text into app:ollama_text_processing as LLM
| table Filename, hunt_total, text, LLM_Duration, LLM_Result

The result of this extra processing inserts a field into each event called LLM_Result, with a classification prediction label, derived from the model’s interpretation of the PowerShell code:

To put this into context, imagine you have 1,000 unique powershell execution alerts in your review queue. For an advanced, well-caffeinated analyst, this would take 83 hours to review at a pace of 5 minutes per alert with no breaks! With a model-in-the-loop, we could pre-screen this data and label every event (averaging 2 seconds per event), prior to analyst review – in 33 minutes!

Final Thoughts

The potential here is transformative. Not light speed, but 250x faster than a human analyst! By offloading the dense, detail-oriented, human language interpretation of code analysis to a fast model that never gets tired or mis-reads a line – threat hunting with a model-in-the-loop will improve accuracy, increase speed, and dramatically reduce the analyst’s workload.

Happy Hunting!

As always, security at Splunk is a team effort. Credit to authors and collaborators: Huaibo Zhao, Philipp Drieger

[1] Assuming a conservative average analysis time of 8 minutes/alert, improving to an initial LLM classification at <3 seconds.

Related Articles

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends
Security
12 Minute Read

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

By analyzing new domain registrations around major real-world events, researchers show how fraud campaigns take shape early, helping defenders spot threats before scams surface.
When Your Fraud Detection Tool Doubles as a Wellness Check: The Unexpected Intersection of Security and HR
Security
4 Minute Read

When Your Fraud Detection Tool Doubles as a Wellness Check: The Unexpected Intersection of Security and HR

Behavioral analytics can spot fraud and burnout. With UEBA built into Splunk ES Premier, one data set helps security and HR reduce risk, retain talent, faster.
Splunk Security Content for Threat Detection & Response: November Recap
Security
1 Minute Read

Splunk Security Content for Threat Detection & Response: November Recap

Discover Splunk's November security content updates, featuring enhanced Castle RAT threat detection, UAC bypass analytics, and deeper insights for validating detections on research.splunk.com.
Security Staff Picks To Read This Month, Handpicked by Splunk Experts
Security
2 Minute Read

Security Staff Picks To Read This Month, Handpicked by Splunk Experts

Our Splunk security experts share their favorite reads of the month so you can follow the most interesting, news-worthy, and innovative stories coming from the wide world of cybersecurity.
Behind the Walls: Techniques and Tactics in Castle RAT Client Malware
Security
10 Minute Read

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Uncover CastleRAT malware's techniques (TTPs) and learn how to build Splunk detections using MITRE ATT&CK. Protect your network from this advanced RAT.
AI for Humans: A Beginner’s Field Guide
Security
12 Minute Read

AI for Humans: A Beginner’s Field Guide

Unlock AI with the our beginner's field guide. Demystify LLMs, Generative AI, and Agentic AI, exploring their evolution and critical cybersecurity applications.
Splunk Security Content for Threat Detection & Response: November 2025 Update
Security
5 Minute Read

Splunk Security Content for Threat Detection & Response: November 2025 Update

Learn about the latest security content from Splunk.
Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It
Security
3 Minute Read

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

The OneCisco approach is not about any single platform or toolset; it's about fusing visibility, analytics, and automation into a shared source of operational truth so that teams can act decisively, even in the fog of crisis.
Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy
Security
5 Minute Read

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Explore how digital sovereignty shapes resilient strategies for European organisations. Learn how to balance control, compliance, and agility in your data infrastructure with Cisco and Splunk’s flexible, secure solutions for the AI era.