Advanced threats often require advanced methods, and advanced methods can require more analysis and development time. However, detecting these behaviors — like the use of dictionary-based Domain Generation Algorithms (DGA) — can result in high-impact findings like the discovery of active malware communication. In these cases, investing the resources to enable hunting high-impact threats is a worthwhile effort.
In a previous post, we outlined an advanced method of the PEAK Threat Hunting Framework called Model-Assisted Threat Hunting (M-ATH), as well as criteria for when to select a M-ATH approach. Because these methods are often more complex or time-consuming, they are best suited for problems that cannot be more simply addressed. A model-based approach can help facilitate the hunting process directly by helping you find leads through the output of the model, or sometimes just help reduce the initial noise in a high-volume or complex data source.
M-ATH applies well for detecting dictionary-based DGA threats, primarily because they require a more advanced method to detect. Additionally, we can produce labeled examples using the algorithms from which they are constructed to support their model-based classification. As an added benefit, the high-volume data source in which they exist (web traffic) can be deduplicated for an efficient retroactive threat hunting exercise.
In this post, we’ll dive deeper into dictionary-based DGA as a specific example to demonstrate how you can develop your own model-based hunting, or apply a pre-trained model through Jupyter Notebooks to hunt your Splunk data, all while following the M-ATH process we’ve established for the PEAK Threat Hunting Framework. Building this type of hunt yourself requires some knowledge of data science, but the pre-trained model can be deployed without any development effort. The goal of this post is to offer a blueprint of the development process, and tools available, to make the topic more accessible to those who have an interest!
PEAK Model-Assisted Threat Hunting Method
First, let's put a plan together – this phase includes developing an understanding of the problem, evaluating the successful approaches, and locating the resources available to help.
Dictionary-based Domain Generation Algorithms evade detection by combining random dictionary words to create domain names that appear legitimate, or at least not overtly suspicious. These domains are difficult to detect because they blend in using pronounceable, known words and mirror the character distribution of legitimate English-language domains. These attributes help them evade traditional DGA classification methods, e.g., features like odd character and vowel distribution:
|Traditional DGA Domains||Dictionary-based DGA Domains|
Machine learning-based classifiers perform well against dictionary-based DGA in academic research, but it is hard to anticipate how well a classifier will generalize to new word lists or changes in the underlying dictionary-DGA algorithms and evolving malware landscape. Additionally, deploying a model-based detection against a high-volume logging source like web traffic can be costly and resource-intensive. These reasons, combined with the viability of modeling approaches confirmed by existing research, and the availability of open-source data to label malicious examples, make this an ideal case for developing a model for a threat hunting exercise. For this task, we will train a model and apply it on a deduplicated list of domains, enabling a quick and efficient M-ATH method for finding threats, or at least reducing our dataset for hunting.
A Recorded Future timeline on the current state of dictionary DGA-related cyber events shows a long history of activity, and helps us identify popular malware families (Nymaim, Matsnu, Ursnif) that have steadily employed this technique over the past few years:
Recorded Future Dictionary-DGA Reference Timeline
We can learn more about how this technique is used, and which adversary groups have used this behavior from the MITRE ATT&CK Dynamic Resolution technique.
Now that we have identified our hunt target, we can look for data from the specific malware families using this technique. Many of the underlying algorithms and word lists have been reverse-engineered and made available by some awesome research on GitHub. For this prototype, I’ve identified six dictionary-DGA algorithms found in malware families called: pizd, suppobox, nymaim, big viktor, gozi, and matsnu. These aren’t datasets, per se, but we can use the algorithms to generate a labeled dataset of our own.
Researchers have established that Long-Short Term Memory (LSTM)-based deep learning is well-suited to detect DGA. We will follow a similar approach to create a labeled dataset, construct and train a neural network, and apply the model to a domain list in Splunk!
It’s time to put our plan into action. This includes generating and preparing data, modifying or developing new code to test an idea, and tweaking and evaluating our model. We can also consider whether the results of our model are actionable as is, or if there is any valuable context or enrichment that can make our decision-making easier.
To train a neural network, we need a labeled dataset for both ‘benign’ and ‘dga’ classes. For the benign datasets, we can use the Alexa 1M. For the ‘dga’ class, we can use the reverse-engineered dictionary-DGA algorithms to generate test data. For this exercise, I modified the python scripts to generate 100,000 samples from each malware family (600,000 total), and balanced the dataset with the top 600,000 from the Alexa 1M. This helps us avoid any challenges from class imbalance, which can negatively impact the model’s performance or training.
The model development step has a lot of flexibility depending on the problem at hand. In this case, we are building an approach based on established research, so we can experiment with multiple model types and deep learning architectures to evaluate differences in their accuracy and performance. In deep learning, this can include the number and types of layers, activation functions, and overall model architecture.
Apply / Refine Model
With the Splunk App for Data Science and Deep Learning (DSDL), we can directly use Python-native data science libraries and integration with Splunk to assist in our threat hunting. Using DSDL’s golden image, we can launch JupyterLab and use a notebook to directly train and test a model. Refining the model can include experimenting with different algorithms or neural network types, evaluating your performance, and optimizing your hyperparameters.
Reviewing the epoch_accuracy and epoch_loss functions helps us visualize how quickly the model is converging on the solution. We’re seeing smooth and steady improvement (accuracy increasing on the left, and loss decreasing on the right) across each training epoch (x-axis), tapering off at a strong classification accuracy (~96%). The fact that loss is decreasing in conjunction with improved accuracy suggests the model isn’t overfitting the data.
Epoch Accuracy and Loss Curves from Model Training
To validate the model, I held out a sample of 100,000 domains from the training data using a stratified sample. Stratification of the sample ensures that each labeled category has an equal proportion of the test sample (14,285 domains).
Confusion Matrix measuring classification performance
For this hunt, I’ve shared a notebook that is set to receive data pushed from Splunk, load the pre-trained LSTM model, and apply a classification score to real-world domains. The notebook also contains a configurable confidence threshold (e.g. set at >=60% confidence by default) for the analyst to tune the level of findings.
M-ATH-based Hunting Notebook in DSDL JupyterLab
Analyze / Escalate Critical Findings
Analyzing your results should include enrichment from any resources available to your organization. This can include resources for mapping domains to Internet Protocol (IP) Address ranges, Autonomous System Numbers (ASN), registrant information, registration and expiration dates, subdomains, domain reputation, etc. – any details that will assist analysts with making a good / bad classification decision.
Sample Prediction Results
In the final phase, we act on our findings. Chase down any leads you’ve found to a resolution, and step back to evaluate your overall process. Was this a successful hunt? Can we operationalize this classifier as a precise detection? Are there pieces of our development process (feature engineering, model selection, hyperparameter optimization) that should be preserved for reuse?
Preserving your hunt can include exporting your trained models, your artifacts from feature engineering, your notebooks, your datasets, and your working notes from the development process – anything that will be useful for a hunter revisiting this topic.
If your detection quality performs well enough to build into a permanent detection, you can go down that route. However, you should consider the potential for this threat to change in the future. What will happen if, for example, an advertiser makes a change to their domain name structure and starts breaking your analytic? Don’t rush to implement a model that may not be easy to interpret, or to detect threats you don’t have confidence will continue to generalize well moving forward.
If a model-based method is performing to your satisfaction, there are a few options for operationalizing the model. With Splunk Machine Learning Toolkit (MLTK 5.4), we can import and share our trained models by directly importing them into Splunk from the Open Neural Network Exchange (ONNX) format, or host the model in DSDL and apply it in MLTK via the apply command. Alternatively, we can apply the model directly in Jupyter Notebooks using DSDL by querying the Splunk REST API, or pushing data from Splunk into the DSDL container.
In this step, wrap up your findings and communicate your results. Be sure to check out our metrics post for tips on accurately measuring the success of your hunt, and communicating the value of your hunt activities!
This post demonstrates the capability and power available for integrating data-science based methods to hunt your Splunk data. Applying M-ATH through the PEAK methodology can help you find advanced threats and put you on a path to find deeper insights in your data. I hope this work encourages you to try your own M-ATH approaches, or give this one a try! For more Splunk security research, rapid response guides, and PEAK Threat Hunting updates, subscribe to our team's SURGe alerts!