Security

June 24, 2025

7 Minute Read

Defending at Machine Speed: Guiding LLMs with Security Context

By Ryan Fetterman

Large Language Models (LLMs) provide strong reasoning and data summarization capabilities, making them valuable proxies for a variety of cybersecurity operations tasks. However, their performance can decline when applied to highly specific or enterprise-contextual tasks, particularly if the models rely solely on public internet data.

This research explores methods to guide LLMs toward targeted security objectives — specifically through continuing our exploration of methods for the classification of PowerShell scripts. By adapting our DSDL workflow to include few-shot learning, Retrieval Augmented Generation (RAG), and fine-tuning, we evaluate the potential performance gains and trade-offs across each technique.

Few-Shot Learning: Teaching by Example

Our first method builds on the prompt of our baseline test, where we asked various open-weight models to perform PowerShell classification in the Splunk platform. Few-shot learning provides the model with a handful of examples, along with their labels, to assist in a classification task. Giving the model a few labeled examples (typically 1 to 5), can help it learn a pattern, or attributes that differentiate the underlying classes. The amount of examples to use is a consideration of the model's context length. The context length includes your prompt, any example inputs, instructions, and the model’s own responses. Tokens, roughly corresponding to words or word-parts, are the units that fill this space.

For example, If a model has a 4,000-token context length, you can prompt it with a few paragraphs of text, several examples, and a task description — as long as the total length stays under 4,000 tokens. If you exceed that limit, the model will either ignore earlier content or fail to respond correctly. With a larger context length, users can include dozens or even hundreds of examples in the prompt — a technique called many-shot learning, which often can provide better performance in addition to this on-the-fly flexibility.

In our model for this test, Llama3-8B, supports a context length of approximately 8,000 tokens. After calculating the average script length in our dataset, six examples (three per class) could be accommodated without truncation. Will just a few examples be enough to improve PowerShell classification accuracy?

For this test we manually modified DSDL’s llm_rag_ollama_text_processing notebook to append the examples directly into the prompt, then using SPL to initiate and analyze the results:

index="powershell_encoded" encoded=*
| eval seed="19844"
| eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16)
| eval rand_val = hash_val / **********
| eventstats count AS total_count BY label
| sort label rand_val
| streamstats count AS class_count BY label
| where class_count <= 500
| table Content, label
| rename Content as text
| fit MLTKContainer algo=RF_llm_rag_ollama_text_processing_megaprompt model_name="llama3" prompt="You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." text into app:RF_llm_rag_ollama_text_processing as LLM

Findings

With just a handful of examples, we observed a boost in the precision and accuracy of the model by +8% and +6% respectively for PowerShell classification. These measures translate to an increase in the correct identification of true positive malicious scripts, with no added cost in the form of average response time.

Baseline	Few-Shot Learning

Method	Precision	Recall	Accuracy	F1	Avg Response
Baseline	0.78	1.00	0.86	0.87	0.74 seconds
Few-Shot Learning	0.86	0.99	0.92	0.92	0.74 seconds

Here are some other tips for getting the most out of few-shot learning:

Know your model’s context length! Too many examples can create “context overflow”, resulting in your input getting truncated. This typically means the earliest/oldest content gets cut-off by the context length.
Remember models think in tokens, not characters. As a rough estimate, 100 tokens ≈ 75 words ≈ 400 characters.
Use representative samples for few-shot learning that are diverse, but relevant. For example, in this case, don’t pick three malicious examples that are using the same Invoke-WebRequest cmdlet.
Balance your classes! You may think the model just needs to see malicious examples, however this can actually cause the model to overgeneralize, which may lead to predicting the malicious class too often.

Retrieval Augmented Generation (RAG): Memory On-Demand

Our second method, RAG, ups the technical complexity a bit, with the aim of also improving accuracy to a level worth the effort. RAG uses a vector database to store encoded data that is retrieved to help provide context in relation to a prompt submitted to the model by the user. To use RAG with DSDL, you must run the Milvus container through Ollama.

To support PowerShell classification, we first embed a series of malicious and benign scripts as vector data, using the ‘DSDL Encoding Assistant’. Once the data is embedded (represented as numerical vectors) in the Milvus database, we use the LLM-RAG Assistant, or llm_rag_script notebook to query the model, with the new collection of data for context. The parameter top_k controls how many retrieved examples (PowerShell script samples) are selected based on their vector similarity to the input query.

To put all of this in context, when the model begins its classification reasoning on a new sample, it will first look at the script in question in vectorized form. Using RAG’s distance metrics, the model will retrieve the top_k closest samples for additional context to make a classification decision. Running this on our repeated random sample looks like:

index="powershell_encoded" encoded=*
| eval seed="19844"
| eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16)
| eval rand_val = hash_val / 4294967295
| eventstats count AS total_count BY label
| sort label rand_val
| streamstats count AS class_count BY label
| where class_count <= 500
| table Content, label
| rename Content as query
| eval query = "You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." + query
| fit MLTKContainer algo=llm_rag_script model_name=llama3 embedder_name="all-MiniLM-L6-v2" use_local=1 embedder_dimension=384 collection_name=RAG_final top_k=2 rag_type=Documents query into app:llm_rag_script

Findings

Using RAG, we observed our biggest improvement in precision (+19%) and accuracy (+7%) from baseline yet, with some opportunity cost in recall and response time. In tasks like PowerShell classification, where malicious samples can vary wildly (e.g., obfuscation, LOLBins, encoded strings), it’s inefficient to pack dozens or hundreds of examples into a static prompt with few-shot learning. Instead, RAG can target only the most similar patterns using vector search, while avoiding irrelevant noise. Gathering this additional context does create a slight trade off in improved accuracy, for increased response time. Interestingly in this case, pulling the closest 4 examples (top_k=4) did not help improve the accuracy versus just pulling the closest two (top_k=2).

Stand-Alone Inference	Few-shot Learning	LLM-RAG
baseline	6 examples (3 malicious / 3 benign)	top_k=2	top_k=4

Method	Precision	Recall	Accuracy	F1	Avg Response
Baseline	0.78	1.00	0.86	0.87	0.74 seconds
Few-shot Learning	0.86	0.99	0.92	0.92	0.74 seconds
LLM-RAG top_k:2	0.97	0.89	0.93	0.93	2.5 seconds
LLM-RAG top_k:4	0.96	0.89	0.93	0.92	6.3 seconds

Here are some other tips for getting the most out of RAG:

Don’t write off RAG because of the increased response time. RAG can be more effective and flexible than many-shot learning when the decision space is large or diverse, because it dynamically retrieves only the most relevant examples, rather than processing all examples in-context every time.
Keep in mind that RAG also relies on the size of the context length. Adjust your top_k, for best performance, but also to ensure you are staying within your boundaries!

Fine-Tuning: From Generalist to Specialist

The final option for targeting a particular security problem is to fine-tune an existing pre-trained model to your specific domain or use case. Fine-tuning builds on the general language knowledge and reasoning of the base model, but modifies the weights slightly in a training process based on additional data. Fine-tuning requirements vary depending on the method used, but can require significant investment of time, and financial and compute resources.

While we can fine-tune Llama3, its decoder-only architecture is more suited for text generation and reasoning tasks, not classification specifically. This year, Cisco’s Foundation AI team released Foundation-Sec-8B, an open-weight model fine-tuned for cybersecurity applications, based on the LLama3-8B model. The Foundation-Sec-8B model is a great asset for supporting a variety of generalized security use cases.

For the powershell classification task at hand, an encoder-only model like RoBERTa, which is specifically designed to encode meaningful representations of its trained data, is much better suited. To test the performance of this option alongside few-shot learning and RAG, we performed full fine-tuning to create a variant called neon-RoBERTa, which added a small neural network layer on top of the pretrained model to perform the binary PowerShell classification task.

Findings

Our fine-tuned, encoder-only model, shows how well we can optimize performance and accuracy when specializing in a narrow task, but with the highest level of resources invested into building a solution. Overall, this model boosts precision by 21%, and accuracy by 13%, while also improving classification time by 99% per sample. This model does not, however, have the inherent natural language abilities of a decoder model like Llama3, or the security domain knowledge of Foundation-Sec-8b – it only focuses on PowerShell classification.

Fine-Tuning

Method	Precision	Recall	Accuracy	F1	Avg Response
Baseline	0.78	1.00	0.86	0.87	0.74 seconds
Few-shot Learning	0.86	0.99	0.92	0.92	0.74 seconds
LLM-RAG top_k:2	0.97	0.89	0.93	0.93	2.5 seconds
LLM-RAG top_k:4	0.96	0.89	0.93	0.92	6.3 seconds
Fine-tuning (RoBERTa)	0.99	0.98	0.99	0.99	7.7 micro-seconds

Here are some final considerations for fine-tuning:

Fine-tuning can be done to the level of a domain, or the detail of a specific task.
Modifying system instructions are a lightweight way to guide the model without fine-tuning.
While this example used full fine tuning on a smaller parameter model, larger models with billions of parameters can be fine-tuned on a budget with methods like QLoRA.
How you operationalize the model can significantly impact the response time, e.g. batching the inputs.

Conclusion

When applying LLMs to a security problem, there are many options available beyond your base prompt that can help guide the model to a better output. We’ve explored few-shot learning and RAG with DSDL, as well as fine-tuning, and measured significant improvements of these methods above our baseline prompt for PowerShell classification.

Few-shot learning and RAG rely on squeezing high-quality examples and relevant information into the context length to help guide the model’s output. The larger the length, the more context the model can use to reason effectively. When in doubt, consider adding some samples to your prompt. Even a small set of examples notably boosted performance in our test case. For more complex challenges, RAG demonstrated the ability to perform particularly well because of the ability to pull only the most relevant examples into context.

Fine-tuning can tailor a models’ knowledge towards better performance in a specific domain or task. Foundation models are best suited for multi-purpose use cases, but can be guided into performing classification. However, for best results on narrow tasks like security classification, it was simpler and faster to fine-tune an existing encoder-only model.

With the rapid pace of AI advancement, there are increasingly more analysis options and opportunities to apply them in cybersecurity. These tools and techniques can be extremely powerful at accelerating and improving detection efforts like in our model-in-the-loop threat hunting concept. We hope this research helps you better understand your options, and guides you towards getting the most out of your security data with Splunk!

Introducing ShellSweepPlus: Open-Source Web Shell Detection

Detect web shells easily with ShellSweepPlus, an open-source tool for detecting potential web shells. Learn how ShellSweepPlus works and how to use it here.

Security 3 Min Read

Vulnerability Prioritization Is a Treat for Defenders

There have been numerous high-profile cybersecurity incidents where vulnerability management had an impact on severe breaches – here are some notable examples.

Security 20 Min Read

Infostealer Campaign against ISPs

The Splunk Threat Research Team observed actors performing minimal intrusive operations to avoid detection, with the exception of artifacts created by accounts already compromised.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram

Follow @Splunk

See Splunk Perspectives blog for execs

Get Perspectives

Defending at Machine Speed: Guiding LLMs with Security Context

Few-Shot Learning: Teaching by Example

Findings

Retrieval Augmented Generation (RAG): Memory On-Demand

Findings

Fine-Tuning: From Generalist to Specialist

Findings

Conclusion

Related Articles

Introducing ShellSweepPlus: Open-Source Web Shell Detection

Vulnerability Prioritization Is a Treat for Defenders

Infostealer Campaign against ISPs

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram

See Splunk Perspectives blog for execs