Using RAG, Splunk ES Content Update App (ESCU), and MLTK to Develop, Enhance, and Analyze Splunk Detections

This blog post focuses on employing a local LLM (llama3:8B) via Ollama for optimization. We aim to refine it using RAG (Retrieval-Augmented Generation), a technique that augments language model outputs by gathering data from external sources prior to response generation. This information will subsequently be employed alongside Splunk MLTK machine learning functions and the latest AI prompt features, enhancing the precision of Splunk detection results.

The following are the components of this research:

ESCU Llama3 RAG System

Overview

This project implements a Retrieval-Augmented Generation (RAG) system designed for ESCU (Enterprise Security Content Update) data using local LLaMA3:8B Inference. The system provides cybersecurity analysis by combining real attack data with AI-powered responses.

Core Functionality

Data Sources

RAG Components

Data Indexer (ESCUDataIndexer)

Input: Raw ESCU directories and files

Context Generator

Input: User query + indexed ESCU data

LLaMA3 Inference Engine

Based on the above elements the following improvements were sought:

ESCU-Specific Data Understanding

Robust Error Handling

Intelligent Context Generation

Local AI Integration

Performance Optimizations

Llama 3.8B ESCU RAG metrics

Data Loading Performance

Query Response Times

Accuracy Improvements

Once the process was finished a new model file was created and saved under Ollama in order to be able to run it from MLTK or from Ollama WebOpenUi if desired.

Based on the above items I proceeded to test some detection techniques, some from ESCU and some MLTK algorithms. I used Splunk Boss of the SOC (BOTS) datasets. The following are some examples of the performed queries and the use of the AI feature from MLTK.

Unusual SSH login - Botsv3

AWS Brute Force Botsv3

Splunk MLTK,Clustering, and Group Similar Attack Patterns Botsv3

In this example I used CISCO ASA data along with Splunk MLTK and the trained Llama3:8b (ESCU refined LLM). In the following detection I used KMeans, which is an unsupervised machine learning algorithm that groups data points into a specified number of clusters based on their similarity. In this case we are looking for similar attack patterns.

In the following example I used DBSCAN. DBSCAN is an unsupervised machine learning algorithm that clusters data points based on density, identifying groups of closely packed points and marking outliers as noise. In this specific example I used windows logs targeting command line lengths.

Final Notes

The use of local Llms can be enhanced via RAG and applied to detection development and analysis. I can tell you that the bigger the model the more accurate and powerful. Initially, I trained Llama4 (quantized) and the results were much more effective specially the inference from the detections.

Unfortunately, due to hardware limitations the connection with MLTK would timeout which forced me to downgrade to Llama3:8B; but if hardware limitation is not an issue, this can be done in the enterprise with local LLMs and definitely enhance the performance and analysis of these detections. I invite you to think of the number of use cases and applications that can be developed now so that we can put Splunk, MLTK, LLMs, and ESCU together.

No results