Large Language Models (LLMs) provide strong reasoning and data summarization capabilities, making them valuable proxies for a variety of cybersecurity operations tasks. However, their performance can decline when applied to highly specific or enterprise-contextual tasks, particularly if the models rely solely on public internet data.
This research explores methods to guide LLMs toward targeted security objectives — specifically through continuing our exploration of methods for the classification of PowerShell scripts. By adapting our DSDL workflow to include few-shot learning, Retrieval Augmented Generation (RAG), and fine-tuning, we evaluate the potential performance gains and trade-offs across each technique.
Our first method builds on the prompt of our baseline test, where we asked various open-weight models to perform PowerShell classification in the Splunk platform. Few-shot learning provides the model with a handful of examples, along with their labels, to assist in a classification task. Giving the model a few labeled examples (typically 1 to 5), can help it learn a pattern, or attributes that differentiate the underlying classes. The amount of examples to use is a consideration of the model's context length. The context length includes your prompt, any example inputs, instructions, and the model’s own responses. Tokens, roughly corresponding to words or word-parts, are the units that fill this space.
For example, If a model has a 4,000-token context length, you can prompt it with a few paragraphs of text, several examples, and a task description — as long as the total length stays under 4,000 tokens. If you exceed that limit, the model will either ignore earlier content or fail to respond correctly. With a larger context length, users can include dozens or even hundreds of examples in the prompt — a technique called many-shot learning, which often can provide better performance in addition to this on-the-fly flexibility.
In our model for this test, Llama3-8B, supports a context length of approximately 8,000 tokens. After calculating the average script length in our dataset, six examples (three per class) could be accommodated without truncation. Will just a few examples be enough to improve PowerShell classification accuracy?
For this test we manually modified DSDL’s llm_rag_ollama_text_processing notebook to append the examples directly into the prompt, then using SPL to initiate and analyze the results:
index="powershell_encoded" encoded=* | eval seed="19844" | eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16) | eval rand_val = hash_val / ********** | eventstats count AS total_count BY label | sort label rand_val | streamstats count AS class_count BY label | where class_count <= 500 | table Content, label | rename Content as text | fit MLTKContainer algo=RF_llm_rag_ollama_text_processing_megaprompt model_name="llama3" prompt="You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." text into app:RF_llm_rag_ollama_text_processing as LLM
With just a handful of examples, we observed a boost in the precision and accuracy of the model by +8% and +6% respectively for PowerShell classification. These measures translate to an increase in the correct identification of true positive malicious scripts, with no added cost in the form of average response time.
Baseline | Few-Shot Learning |
---|---|
![]() | ![]() |
Method | Precision | Recall | Accuracy | F1 | Avg Response |
---|---|---|---|---|---|
Baseline | 0.78 | 1.00 | 0.86 | 0.87 | 0.74 seconds |
Few-Shot Learning | 0.86 | 0.99 | 0.92 | 0.92 | 0.74 seconds |
Here are some other tips for getting the most out of Few-shot learning:
Our second method, RAG, ups the technical complexity a bit, with the aim of also improving accuracy to a level worth the effort. RAG uses a vector database to store encoded data that is retrieved to help provide context in relation to a prompt submitted to the model by the user. To use RAG with DSDL, you must run the Milvus container through Ollama.
To support PowerShell classification, we first embed a series of malicious and benign scripts as vector data, using the ‘DSDL Encoding Assistant’. Once the data is embedded (represented as numerical vectors) in the Milvus database, we use the LLM-RAG Assistant, or llm_rag_script notebook to query the model, with the new collection of data for context. The parameter top_k controls how many retrieved examples (PowerShell script samples) are selected based on their vector similarity to the input query.
To put all of this in context, when the model begins its classification reasoning on a new sample, it will first look at the script in question in vectorized form. Using RAG’s distance metrics, the model will retrieve the top_k closest samples for additional context to make a classification decision. Running this on our repeated random sample looks like:
index="powershell_encoded" encoded=* | eval seed="19844" | eval hash_val = tonumber(substr(md5(seed . label . Content), 1, 8), 16) | eval rand_val = hash_val / 4294967295 | eventstats count AS total_count BY label | sort label rand_val | streamstats count AS class_count BY label | where class_count <= 500 | table Content, label | rename Content as query | eval query = "You are a cybersecurity expert. Classify the intent of the PowerShell script. Choose from 'malicious' or 'benign'. Only output the category name in your answer." + query | fit MLTKContainer algo=llm_rag_script model_name=llama3 embedder_name="all-MiniLM-L6-v2" use_local=1 embedder_dimension=384 collection_name=RAG_final top_k=2 rag_type=Documents query into app:llm_rag_script
Using RAG, we observed our biggest improvement in precision (+19%) and accuracy (+7%) from baseline yet, with some opportunity cost in recall and response time. In tasks like PowerShell classification, where malicious samples can vary wildly (e.g., obfuscation, LOLBins, encoded strings), it’s inefficient to pack dozens or hundreds of examples into a static prompt with few-shot learning. Instead, RAG can target only the most similar patterns using vector search, while avoiding irrelevant noise. Gathering this additional context does create a slight trade off in improved accuracy, for increased response time. Interestingly in this case, pulling the closest 4 examples (top_k=4) did not help improve the accuracy versus just pulling the closest two (top_k=2).
Stand-Alone Inference | Few-shot Learning | LLM-RAG | |
---|---|---|---|
baseline | 6 examples (3 malicious / 3 benign) | top_k=2 | top_k=4 |
![]() | ![]() | ![]() | ![]() |
Method | Precision | Recall | Accuracy | F1 | Avg Response |
---|---|---|---|---|---|
Baseline | 0.78 | 1.00 | 0.86 | 0.87 | 0.74 seconds |
Few-shot Learning | 0.86 | 0.99 | 0.92 | 0.92 | 0.74 seconds |
LLM-RAG top_k:2 | 0.97 | 0.89 | 0.93 | 0.93 | 2.5 seconds |
LLM-RAG top_k:4 | 0.96 | 0.89 | 0.93 | 0.92 | 6.3 seconds |
Here are some other tips for getting the most out of RAG:
The final option for targeting a particular security problem is to fine-tune an existing pre-trained model to your specific domain or use case. Fine-tuning builds on the general language knowledge and reasoning of the base model, but modifies the weights slightly in a training process based on additional data. Fine-tuning requirements vary depending on the method used, but can require significant investment of time, and financial and compute resources.
While we can fine-tune Llama3, its decoder-only architecture is more suited for text generation and reasoning tasks, not classification specifically. This year, Cisco’s Foundation AI team released Foundation-Sec-8B, an open-weight model fine-tuned for cybersecurity applications, based on the LLama3-8B model. The Foundation-Sec-8B model is a great asset for supporting a variety of generalized security use cases.
For the powershell classification task at hand, an encoder-only model like RoBERTa, which is specifically designed to encode meaningful representations of its trained data, is much better suited. To test the performance of this option alongside few-shot learning and RAG, we performed full fine-tuning to create a variant called neon-RoBERTa, which added a small neural network layer on top of the pretrained model to perform the binary PowerShell classification task.
Our fine-tuned, encoder-only model, shows how well we can optimize performance and accuracy when specializing in a narrow task, but with the highest level of resources invested into building a solution. Overall, this model boosts precision by 21%, and accuracy by 13%, while also improving classification time by 99% per sample. This model does not, however, have the inherent natural language abilities of a decoder model like Llama3, or the security domain knowledge of Foundation-Sec-8b – it only focuses on PowerShell classification.
Fine-Tuning |
---|
![]() |
Method | Precision | Recall | Accuracy | F1 | Avg Response |
---|---|---|---|---|---|
Baseline | 0.78 | 1.00 | 0.86 | 0.87 | 0.74 seconds |
Few-shot Learning | 0.86 | 0.99 | 0.92 | 0.92 | 0.74 seconds |
LLM-RAG top_k:2 | 0.97 | 0.89 | 0.93 | 0.93 | 2.5 seconds |
LLM-RAG top_k:4 | 0.96 | 0.89 | 0.93 | 0.92 | 6.3 seconds |
Fine-tuning (RoBERTa) | 0.99 | 0.98 | 0.99 | 0.99 | 7.7 micro-seconds |
Here are some final considerations for fine-tuning:
When applying LLMs to a security problem, there are many options available beyond your base prompt that can help guide the model to a better output. We’ve explored few-shot learning and RAG with DSDL, as well as fine-tuning, and measured significant improvements of these methods above our baseline prompt for PowerShell classification.
Few-shot learning and RAG rely on squeezing high-quality examples and relevant information into the context length to help guide the model’s output. The larger the length, the more context the model can use to reason effectively. When in doubt, consider adding some samples to your prompt. Even a small set of examples notably boosted performance in our test case. For more complex challenges, RAG demonstrated the ability to perform particularly well because of the ability to pull only the most relevant examples into context.
Fine-tuning can tailor a models’ knowledge towards better performance in a specific domain or task. Foundation models are best suited for multi-purpose use cases, but can be guided into performing classification. However, for best results on narrow tasks like security classification, it was simpler and faster to fine-tune an existing encoder-only model.
With the rapid pace of AI advancement, there are increasingly more analysis options and opportunities to apply them in cybersecurity. These tools and techniques can be extremely powerful at accelerating and improving detection efforts like in our model-in-the-loop threat hunting concept. We hope this research helps you better understand your options, and guides you towards getting the most out of your security data with Splunk!
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.