Large amounts of data no longer reside within siloed applications. A global workforce, combined with the growing need for data, is driving an increasingly distributed and complex attack surface that needs to be protected. Sophisticated cyberattacks can easily hide inside this data-centric world, making traditional perimeter-only security models obsolete. The complexity of this interconnected ecosystem now requires one to assume that the adversary is already within the network and consequently must be detected there, not just at the perimeter.
Log Parsing is the First Step in Cybersecurity
What Are Machine Logs
Machine logs are generated by appliances, applications, machinery, and networking equipment, (switches, routers, firewalls, etc.) Every event, along with its information, is sequentially written to a log file containing all of the logs. Although some logs are written in a structured format (e.g JSON, XML), many applications write logs in an unstructured, hard-to-digest way.
Before Mining Logs, We Must Parse Them
We can use artificial intelligence techniques on logs to detect cybersecurity threats within the network, but we must first take the raw, unstructured logs, and transform them into an easily-digestible, structured format. This transformation is called “log parsing”, where all the different entities, or “fields”, are extracted from unstructured text. This nice structured data can then be fed into downstream cybersecurity pipelines.
For example, the image below shows the desired input for a log parser (an unstructured log), and its output (a structured map from field name to its value):
The log parser is extracting the following fields: timestamps, dvc (device number), IP addresses, port numbers, etc.
Given the volume (petabytes per day) and value of the data within machine logs, log parsing must be scalable, accurate, and cost efficient. Historically, this has been solved using complex sets of rules, but new approaches, combined with increases in computational power, are enabling fast log parsing using neural networks, providing significant parsing advantages.
In this blog, we’ll walk you through how we use machine learning (ML) to solve log parsing, the experiments we ran in collaboration with NVIDIA to determine how to deploy our models at scale using the NVIDIA Triton Inference Server, and our next steps.
Why Use ML For Log Parsing
Traditionally, log parsing has been done with regular expression matching, called regexes. While this was a great initial solution, regexes come with several big challenges:
- Writing a set of regexes in the first place is hard. They must not be overly or too loosely constrained, making it extremely difficult to write hundreds of regexes while keeping an appropriate constraining balance.
- It is hard to manage regexes for every application and its version.
- Regexes easily break when there is even a slight variation in the input. This can happen because of log format changes, bugs, software updates, etc.
- It can be more computationally expensive to run hundreds of regexes over every log than running a machine learning model a single time.
Alternatively to regexes, machine learning models are easier to train and manage and are more robust to changes in the underlying logs.
This can be attributed to the representation power of ML models, as well as the similarities in the logs generated by different applications. As an example, consider the similarities between Hadoop and Spark logs below:
Although the values may vary across different log types, many fields are still shared. We can take advantage of this similarity to build robust ML models with a deeper understanding of machine logs.
Solving Log Parsing With ML
We first wondered: how much benefit is there to using ML, and what is the best way to model this problem?
Why Template-Generation Algorithms Don’t Solve Our Problem
We started by experimenting with unsupervised, template-based log parsing algorithms like Drain and Spell, which automatically generate a set of “templates”, or regular expressions, that can parse logs. These templates have a set of “wildcards” denoted by <*>, which represents the location of a variable token within the log.
For example, below are two templates generated for Spark logs:
<*> Reading broadcast variable <*> took <*> ms
<*> Saved output of task 'attempt_<*>' to hdfs://<*>
Since these approaches are unsupervised, additional manual work is required to map from fields to one or more wildcards. For example, the first wildcard in each template above should be labeled as “timestamp,” but Drain and Spell cannot produce a label for these wildcards, since the field names do not appear in the logs.
If the number of templates were small, a human could manually label each wildcard. However, because the number of templates and wildcards can grow linearly with respect to the number of logs, this approach quickly becomes infeasible.
The graphs below plot the number of unique templates (in blue) and the number of wildcards (in orange) generated by Drain vs. the number of events seen. We can see the number of wildcards increasing linearly:
Beyond the fact that template-based algorithms produce too many templates and don’t produce labeled fields, a deeper issue is that these approaches are not truly learning algorithms in that (1) they don’t necessarily get better with more data, and (2) there is no practical way to tune them to specific datasets. In fact, as we see in the graphs above, the number of templates proliferates with more data, leading to issues with noise, rather than stabilizing as a complete representation is learned. In addition, some extracted fields and templates might be rare and require more examples, while some are common and easier to extract, but these algorithms aren’t able to learn this or to treat these cases differently.
While these algorithms do not solve our problem, they can be used for log clustering, which has many other valuable applications.
Shifting to Named Entity Recognition
Since template-based algorithms didn’t solve our problem, we shifted our focus to the natural language processing task of named entity recognition (NER). Supervised NER models do exactly what we need: they find entities (“fields”, in our case) within the text. Below, we can see an example output from an NER model for Cisco ASA firewall logs:
As you can see, the NER model can identify timestamps, dvc (device number), IP addresses, port numbers, etc.
Evaluating CRF, LSTM, and Transformer Models For NER
To evaluate the 3 ML model architectures, we ran experiments on dozens of “easy” datasets where all the models perform incredibly well, but below we focus on the accuracy for three “hard” datasets.
Because all models get near perfect accuracy when the training and testing data are similar, we altered these datasets to specifically test for robustness. For the Spark and Hadoop datasets, we sort the logs by date, using the most recent data for training and later ones for testing. The Cisco ASA dataset is even more difficult because we train on simulated logs that were generated using gogen and we evaluate it on real-world Cisco ASA logs.
In the figure above, we see that CRF (yellow) and BERT (orange) models consistently achieve the highest accuracy, while the LSTM (pink) and MiniBERT (purple) achieve lower but comparable accuracy on the Spark and Hadoop datasets. This indicates that while all models do well on smaller domain shifts, for larger domain shifts such as Cisco ASA (where the training data is simulated), the smaller transformer and LSTM achieve considerably lower accuracy. This indicates that the models can overfit to the training data and aren’t as robust to domain shifts that may be present in production. This is a problem that can potentially be solved by data augmentation, different character embeddings/training procedure, distilling from a pretrained model such as a transformer, etc.
Next, we check the throughput of each model to assess model scalability and cost. Below is the throughput of each model with a batch size of 64 (with the exception of CRF as it cannot do batch inference due to its Viterbi decoding -- which is a dynamic programming algorithm):
As we can see, the LSTM reaches a much higher throughput than other models.
Given that the LSTM achieves high accuracy and has much higher throughput than the other models, we decided to kick off our ML-powered field extraction with LSTMs. In the following visualization, we can see an image showing the inputs and outputs for an LSTM trained on annotated Cisco ASA logs.
How Do We Efficiently Deploy Our ML Models?
Now that we knew we wanted to start with an LSTM architecture, our next question became: how do we deploy our LSTM models in a scalable and cost-efficient manner? Given Splunk’s customers can often deal with petabytes of data per day, this is a huge challenge. After assessing several options, we discovered NVIDIA Triton, an open source software project which simplifies the deployment of AI models at scale in production and is designed exactly for this purpose.
After going over Triton’s documentation, we had three main questions:
- Is it fast enough? In our experimentation, we found that running LSTM models on CPU gave us a throughput of ~640 predictions per second on an AWS c5.4xlarge instance, or about 40 predictions per core per second. Can NVIDIA Triton speed this up?
- Is it cost efficient? We estimate our cost to be $0.295 per 1M predictions using the throughput estimate from (1). Can Triton make this cheaper?
- Can it help us speed up our development cycle? At Splunk, we heavily utilize Apache Flink streaming technology to process logs at massive scale. However, Apache Flink requires that jobs be written in Scala, while our Applied Research team typically runs experiments in Python. This difference requires our applied researchers to reimplement some of their Python code in Scala, which can introduce bugs, and slow down development. Could Triton remove the need for translating Python code to Scala?
With these questions in mind, we embarked on an experimental journey with the NVIDIA team to investigate how Triton and GPUs could help us. In the beginning, we weren’t sure if GPUs would be even worthwhile; although they are known to be faster, they are also significantly more expensive. In addition, since the LSTM model we planned to use is a relatively small neural-network, we may not see large performance gains by switching from CPU to GPU. By working with the NVIDIA team, however, we were able to come up with a series of experiments to definitively answer these questions. The two key experiments we ran were:
- Should we use Triton on CPUs or GPUs? Our experiments show that inference with Triton is 23x faster and 400% cheaper on GPU vs on CPU. Below, we can see how inference cost decreases with larger batch sizes:
- Which GPU instance is most cost-effective? Our experiments show that LSTMs on g4dn AWS instances (T4 GPUs) are 50% slower than p3 instances (V100 GPUs). However, since g4dn instances are about 6 times cheaper than p3, they’re 75% more cost-effective.
Knowing that LSTM inference is cheaper and faster on g4dn GPUs, we designed and proposed the following architecture:
In the above architecture, the stream of raw logs enters Splunk, where an Apache Flink job picks them up one by one and sends them as a request to a Triton cluster running LSTM models on g4dn AWS instances. The Triton server processes and returns the structured log to Apache Flink, which then passes it on to downstream consumers. Using this architecture allows us to independently scale the Flink and Triton clusters as needed.
Below, we can see a picture showing what Triton is responsible for:
In order for us to send Triton a raw log and get back a structured log, Triton must perform pre-processing, inference, and post-processing. This is done using Triton’s “ensemble” models, which can run arbitrary Python code using Triton’s Python backend. This not only simplifies our architecture by putting all log-processing in one location, but also allows our applied researchers to easily deploy development code without translating it to Scala. This latter point gives the two-fold benefit of both speeding up researchers’ development cycle, as well reducing the opportunities for bugs to be introduced.
How Splunk Customers Could Access These Models
We envision an “Extract Fields (ML-Powered)” function on Splunk’s Data Stream Processor. Below you can see a picture of a real-life demo:
In the above, a theoretically infinite stream of logs is ingested through a Splunk Firehose, which is then fed to an “eval” function that extracts the raw text and passes it to the “Extract Fields (ML-Powered)” function to be parsed. From there, the parsed log fields can be routed to any downstream pipeline.
With our ongoing research, testing, and development, Splunk enables organizations to gain more value from their growing volumes of machine-generated data, included via accelerated machine learning using NVIDIA platforms such as Triton and Morpheus.
Future Steps in Architecture Development
We are currently building a proof of concept for the proposed architecture so that we can accurately estimate the full-system inference cost, and compare it against the current CPU-based architecture. Stay tuned!
Future Steps in Our ML Research
When we compared the 4 different NER models, we found BERT wins in 3 out of 4 categories, but its inference speed presents a challenge. We can address this problem in two ways; either we can train a smaller, faster transformer model. For example, Mini BERT has a much faster inference time than BERT, so we could focus on improving its accuracy by either pre-training on log data instead of natural language text, or by altering the training paradigm to better suit log structure.
Alternatively, we could instead focus on distilling BERT to a smaller RNN model like an LSTM through known knowledge transfer methods, removing all the unnecessary natural language understanding and only keeping the relevant knowledge for log parsing. However, we might want to use transformer models in order to take advantage of the wave of amazing improvements and optimizations coming out faster than ever.
This blog was co-authored by Abraham Starosta, Lisa Dunlap, Tanner Gilligan, Kristal Curtis, Zhaohui Wang, Vibhu Jawa, and Rachel Allen.
Abraham Starosta is an Applied Scientist at Splunk, where he works on streaming machine learning and Natural Language Processing problems. Prior to Splunk, Abraham was an NLP engineer at high growth technology startups like Primer and Livongo, and interned at Splunk in 2014. He completed his B.S and M.S in Computer Science from Stanford, where his research focused on weak supervision and multitask learning for NLP.
Lisa Dunlap is a Machine Learning Engineer on Splunk’s Applied Research Team, semi-successfully researching natural language modeling in a data-limited environment. Prior to Splunk, she studied explainable AI and distributed systems for ML at UC Berkeley RISE Lab. She obtained a B.A. in mathematics and computer science from Berkeley and will be returning for her PhD.
Tanner Gilligan is a Machine Learning Engineer on Splunk's Applied Research Team, working on developing scalable and robust machine learning solutions for our customers. Prior to Splunk, he worked as a backend engineer at Oracle, started an AI startup with Abraham Starosta, and also spent several years doing AI consulting for multiple companies. Tanner received both his B.S. and M.S. in Computer Science from Stanford in 2016, with both degrees heavily focusing on AI.
Kristal Curtis is a Senior Software Engineer at Splunk. She is part of the Applied Research Team and works on machine learning and engineering projects related to automating the ingest to insights workflow in Splunk. Before joining Splunk, Kristal earned her PhD in Computer Science at UC Berkeley, where she was advised by David Patterson and Armando Fox and belonged to the RAD and AMP Labs.
Zhaohui Wang is a Principal Applied Scientist on Splunk’s applied research team. He works on streaming machine learning, NLP and security related problems. Prior to Splunk, he was a quant in machine learning at Wells Fargo. He graduated with a PhD in applied mathematics from North Carolina State University.
Vibhu Jawa is a software engineer and data scientist on the RAPIDS team at NVIDIA, where his efforts are focused on building GPU-accelerated data science products. Prior to NVIDIA, Vibhu completed his M.S at Johns Hopkins where his research was focused on Natural Language Processing and building interpretable machine learning models for healthcare.
Rachel Allen is a senior cybersecurity data scientist on the Morpheus team at NVIDIA, her focus is the research and application of GPU-accelerated machine learning methods to help solve information security challenges. Prior to NVIDIA, Rachel was a lead data scientist at Booz Allen Hamilton where she designed a variety of capabilities for advanced threat hunting and network defense. She holds a bachelor’s degree in cognitive science and a PhD in neuroscience from the University of Virginia.