The SMLS team enables Splunk customers to find obscure and buried threats in large amounts of data through expert analytics. This work is part of a set of machine learning detections built by a specialized team of security-focused data scientists working in concert with Splunk’s threat research teams to help Splunk customers sift through vast amounts of data to identify and alert users of suspicious content.
Based on recent threat research, a large percentage of organizations experienced DNS attacks. There are various mechanisms to establish Command and Control infrastructure, and one of them is Dynamic Resolution which uses Domain Generation Algorithms (DGA). As malware families evolve, it will only get more challenging for defenders to detect, block and track these threats in real-time. The SMLS team has developed a detection in Enterprise Security Content Update (ESCU) app which predicts DGA generated domains using a pre-trained Deep Learning (DL) model. The model is deployed using the Splunk App for Data Science and Data Learning (DSDL) and further details can be found here.
Adversaries communicate with a compromised host to either send instructions or retrieve data using a tactic known as Command and Control (C2). The C2 infrastructure which adversaries control are leveraged for ransomware, data theft, lateral movement and other malicious activity.
Static domain names and IP addresses of C2 systems can quickly be blocked, and it is time consuming for adversaries to reinstate new C2 infrastructure. Hence often, adversaries dynamically generate domain names using DGA to create C2 infrastructure not prone to static analysis disruption. DGAs dynamically generate second-level domain names either on demand or periodically using a seed. The seed could be anything like a time factor, daily temperatures or a trending topic on social media that ensures reproducibility of the same domain names by adversaries and the malware-infected host. DGA generated domains could also be a concatenation of English dictionary words which makes detection more complicated. Compromised hosts and malware backend code are synchronized on the DGA and its seed, and adversaries register a subset of DGA generated domains in advance of the compromised hosts reaching out for C2.
The first tracked use of DGA for malware reaches back to 2008, when worms Conficker.a and .b generated 250 domain names per day while conficker.c generated 50,000 domain names every single day. Most recently, a botnet Orchard generated DGA domains using bitcoin account transaction information instead of a time-based seed making it extremely unpredictable.
The DGA App for Splunk, developed by Philipp Drieger, detects malicious domain names using Machine Learning ToolKit (MLTK). The DGA app generates tf-idf features from domain text and uses classic ML approaches for classifying DGA domains. Our work explored how DL approaches are more robust and accurate in identifying complex patterns compared to classic ML approaches. Content Delivery Networks (CDNs) are known to use DGA which triggers the detection causing false positives. We experimented with deep learning architectures and trained our model on huge datasets comprising various DGA families and non DGA domains to reduce the number of false positives. The next few sections discuss in detail the model architecture and evaluation metrics.
DGA Domain Detection
The complexity of DGAs has made manual reverse engineering or pattern matching efforts extremely complicated as there are potentially numerous algorithms. This is where ML can be useful.
Classic ML approaches like the Random Forest model have been quite successful in the past. These approaches memorize simple patterns which are effective and interpretable but generalizing needs more feature engineering. On other hand, DL approaches generalize better on unseen patterns with less feature engineering but tend to over-generalize by distorting simple patterns in the data.
For the DGA detection, we use a non-sequential neural network architecture called the wide-deep learning. This approach allows the neural network to learn both simple and deep patterns using wide and deep paths thereby taking the best of both approaches. The wide path consists of a subset of features derived from the feature engineering process (e.g., features characterizing the domain names). The deep path is a typical sequential DL model where the input, domain name, will pass through several layers such as tokenization, embedding and feed forward neural networks. Let’s take a deep dive into the wide and deep paths separately and see how we can combine them into a wide-deep architecture.
During the data exploration phase, we computed a variety of features such as the length of the domain, vowel ratio, consonant ratio, digit ratio, hexadecimal ratio, entropy and analyzed how well it correlates to the class label - DGA or non-DGA. Other useful features such as n-gram similarity score with English dictionary words and non-DGA domains were also considered.
We handpicked the features listed below that best discriminate between the two classes and used them as features for the wide-part of the wide-deep model.
- Length of the domain: DGA domains are usually longer than non-DGA domains.
- Shannon entropy of the domain: Entropy of character distribution in DGA domains is higher than that of non-DGA domains.
- N-gram similarity score with English dictionary words: A score was computed to indicate the degree of similarity of domain name with English dictionary words. This feature tells us if DGA domains are a concatenation of English dictionary words and appear as genuine domains.
- N-gram similarity score with non-DGA domains: In the case where domain names are not a simple concatenation of English dictionary words, we use a score to indicate the degree of similarity of domain names with non-DGA domains. This feature tells us if DGA domains appear as non-DGA domains.
- Is it present in frequently visited domains: A flag that indicates whether the domain is present in frequently visited domains. An assumption we make is that if a domain is among frequently visited domains, then it is less likely to be suspicious.
The deep path of the wide-deep model consists of a sequential deep learning architecture, where a single input passes through several layers before an output can be computed. We first vectorize the domain name using a Tokenization layer that splits up the domain text by character and converts them into integers containing indexes of each character. The domain text is usually made up of a-z, A-Z, 0-9 and a few special characters.
The tokenized text output is then processed by an Embedding layer. Embeddings are dense numerical representations in vector space which quantify the semantic similarity between domains. Subsequently, the dense embeddings are processed by a Long Short Term Memory (LSTM) layer which is a special variant of the Recurrent Neural Networks, suitable for learning long-term dependencies in the input. For the deep path, we use a LSTM comprising a single layer with many neurons/hidden units matching the embedding output dimension. The activation function used is Rectified Linear Unit (ReLU). Additionally, the drop out rate is set to 0.5, to avoid model overfitting.
Combining Wide and Deep Paths
The features from the Wide path and output of the LSTM layer are combined to form a dense input vector which is then processed by an output dense layer with a Sigmoid activation function. The output of the final dense layer is a probability score indicating how likely a domain is to be DGA generated. The threshold for classifying a domain is as DGA generated domain is set at 0.5 . This score is extremely useful in tuning the final SPL detection to the Splunk customer’s risk/noise tolerance level.
As there are many hyper-parameters used in designing a DL model (e.g., number of layers, number of units, activation functions and drop out rate), we performed hyper-parameter tuning. Hyper-parameter tuning evaluates variations of models using a combination of parameter values in a specified range and selects one of the model variants having optimal performance.
Training on GPUs
Since training DL models on large volumes of data can take hours if not days, we leveraged GPUs, which are specialized at performing advanced mathematical transformations, for our model computation. There are two main approaches - Data Parallelism and Model Parallelism. We used the Data Parallelism approach, where the data is divided into splits and each split is processed by a GPU. The model variables are mirrored across all GPUs and are maintained in sync by the GPU supporting library. For implementation purposes, we used the Tensorflow 2.2 version and Mirrored Strategy for working with GPUs.
We experimented with various DL architectures, ML approaches and benchmarked all models on a test data set. The wide-deep model was chosen because it had overall optimal performance, specifically with a low false negative rate. This section discusses key model evaluation parameters.
- The Receiver Operating Characteristic (ROC) curve, shows performance of the classifier at all possible thresholds comparing the True Positive Rate and False Positive Rate. The Area Under the ROC Curve (AUC) is 0.99974 (1 for perfect classification) which indicates the model is able to distinguish between the classes.
- Confusion Matrix describes the classifier performance by comparing actual and predicted values. The bottom left quadrant is imperative for security classification problems and specifically how many DGA generated domains are classified as non-DGA domains for this problem. This metric, False Negative Rate is extremely low at 0.04% . The False Positive rate, which indicates how many non-dga domains are classified as DGA generated domains, is also low at 1.03%, ensuring that detection does not cause alert fatigue.
Putting It All Together
Since the model is pre-trained and is made available here, the ESCU detection uses just the “apply” command to classify domains using the pre-trained model. For the detection to work successfully, the pre-trained DGA model must be deployed in a container using the Splunk App of Data Science and Deep Learning (DSDL). The instructions to deploy the model using DSDL are here.
The ESCU DGA detection is based on the Network Resolution data model. The DNS.query field is a fully qualified domain name, which is the input to the classification model. The apply command invokes the model from the Splunk App DSDL container using a list of unique query values. The results of the search are those queries/domains that are most likely a DGA domain, i.e., those domains that have a dga_score > 0.5. This threshold can be fine tuned in the detection where it is currently set as 0.5.
| tstats `security_content_summariesonly` values(DNS.answer) as IPs min(_time) as firstTime max(_time) as lastTime from datamodel=Network_Resolution by DNS.src, DNS.query | `drop_dm_object_name("DNS")` | rename query AS domain | fields IPs, src, domain, firstTime, lastTime, src | apply pretrained_dga_model_dsdl | rename pred_dga_proba AS dga_score | where dga_score>0.5 | `security_content_ctime(firstTime)` | `security_content_ctime(lastTime)` | table src, domain, IPs, firstTime, lastTime, dga_score | `detect_dga_domains_using_pretrained_model_in_dsdl_filter`
DGA domains qualifying as suspicious by the pre-trained DL model are then ingested into the Risk data model Risk.All_Risk for Risk-Based Alerting (RBA), where each risk event is associated with a source, hostname and risk score.
The Risk incident rules ATT&CK Tactic Threshold Exceeded For Object Over Previous 7 Days and Risk Threshold Exceeded For Object Over 24 Hour Period act upon risk events to generate notables that need immediate action. Below is an example, where the Splunk Enterprise Security (ES) Incident Review page is shown. The incident review page lists out recent notables for the analyst. A notable is generated because the aggregated risk score in the past 24 hours for a risk object, accessing DGA domains, exceeds the prefixed risk score threshold. The details show sources (correlation searches) contributing to this notable.
The ESCU DGA detection uses a simple “apply” command with a single feature, DNS.query, to determine whether the domain is DGA generated or not. Detecting DGA domains can get challenging when DGA domains are generated in volume and when generation algorithms get complex. The DGA detection model we developed has been trained on a large representative dataset and tuned for optimal performance. It has been chosen after benchmarking across various deep learning architectures and classic ML approaches. The classification accuracy for this model is 99.37% with an extremely low False Negative Rate of 0.04% and a True Positive rate of 99.96% .
Any feedback or requests? Feel free to put in an issue on GitHub and we’ll follow up. Alternatively, join us on the Slack channel #security-research. Follow these instructions if you need an invitation to our Splunk user groups on Slack.
Special thanks to Philipp Drieger, Splunk Threat Research, and the Splunk Product Marketing Team for your support in releasing this detection in ESCU v3.50.0.
This blog was co-authored by Abhinav Mishra (Principal Applied Scientist), Kumar Sharad (Senior Threat Researcher) and Namratha Sreekanta (Senior Software Engineer).