Most classical, batch-oriented machine learning systems follow the paradigm of “fit and apply”. In an earlier blog post, I discussed a few patterns on how to better organize data pipelines and machine learning workflows in Splunk. In this blog, we’ll review how you can organize your machine learning model in a new way: online learning.
Batch Learning Vs. Online Learning
The difference between batch learning and online learning systems is that in the first approach, you attempt to learn from a whole dataset at once and in the latter, you take incremental steps and constantly update your model “online”. In practical applications, there are pros and cons for each, making it hard to decide which approach is more suitable.
The main advantage of an online learning system is the typically lower compute and memory footprint because you don’t have to process a large dataset as is the case in traditional batch learning. Although it could potentially be costly to perform batch data processing and take time to train the model, you can continuously feed smaller batches of data incrementally to the online learner and get faster responses. The system learns from the batches and memorizes the important characteristics in its model representation while continuing to apply them to make inferences about the data presented. Additionally, as soon as new data points arrive, the model can adapt to new situations, and therefore keep learning.
With all those advantages, please keep in mind that there are also challenges that you need to consider - the model should be able to handle concept drift, which can occur when data changes significantly. Additionally, if you only have the online model, but no longer the historical data, it is difficult to meaningfully retrain the model if something goes wrong in your data or the online algorithm of choice. In production-grade systems, you ideally have strategies in place to deal with such situations, especially if you rely on an online learning system for business-critical applications. Nevertheless, this approach is still a viable tool in your belt to consider for your use case.
Example of an Online Learning Anomaly Detector
Since version 3.8 the Splunk App for Science and Deep Learning (DSDL), formerly known as the Deep Learning Toolkit (DLTK) allows you to tap into online learning algorithms powered by the River Python library with a dedicated container image and an example for an online learning anomaly detector based on the HalfSpaceTrees algorithm, an online variant of isolation forests. They work well when anomalies are spread out.
In the screenshot above, you can see a simple time series of the access count to a Recruiting Service, represented by the blue bars. In the line chart overlay, you can see the green line indicating an anomaly score, which is calculated by the online learning model. On the left side of the chart, you’ll notice that the score appears after a certain defined warm-up phase which is quite typical for online learners. If you follow the green line even more closely, you can also see how, after a while, the learner adjusts from an average value of 0.40 to a lower value stabilizing around 0.25 on the right end of the chart. Finally, the orange line indicates the flagged anomalies based on a threshold that can be easily adjusted based on the desired sensitivity of the detector. That’s how the 11 anomalies are automatically spotted and could now very easily be used for alerting purposes or more sophisticated correlation searches.
Online Learning Workflow with Splunk and DSDL
To conclude this online learning example, let’s look at what a practical workflow in Splunk would look like. Typically you would take on the following steps to get your online learning system up and running in DSDL:
- Identify the appropriate algorithm in River and implement it as a DSDL Jupyter Notebook, e.g. like the existing river_halfspacetree.ipynb example.
- Create an initial model of your online learner with a search that contains your base search and … | fit MLTKContainer algo=river_halfspacetree window_size=100 n_trees=10 height=3 Recruiting into app:online_anomaly_detector … and has access to some data that works well with your algorithm of choice
- Now as your model named “online_anomaly_detector” exists, you can launch a dedicated container for this model to be served uniquely for your use case.
- Define a search that contains … | apply online_anomaly_detector … and run it on the desired schedule, e.g. every 5 minutes on the last 5 minutes of new data. The existing online learner does inference on the new data and subsequently learns from its characteristics and updates itself.
- Depending on how you want to make your results actionable, you can decide e.g. to alert on the anomalies directly or write them into a summary index for further consumption on a dashboard or subsequent correlation searches.
- Optionally, add any additional logging or scoring to improve your machine learning operations and keep track of your model health and performance.
I hope this blog post provides you with a novel approach on some of your machine learning challenges. Please note that not all algorithms are equally suited for online learning purposes, so you should carefully evaluate use cases and compare possible online learning approaches with other traditional batch learning approaches to make an informed decision on what is a better fit.
If you are looking to learn more about the Splunk App for Data Science and Deep Learning, you can watch this .conf session to explore how BMW Group is using DSDL for a predictive testing strategy in automotive manufacturing. In case you are interested in how to use DSDL to scale out forecasting with prophet, stay tuned for another blog post coming soon.
Happy online learning,
Many thanks to Judith Silverberg-Rajna, Katia Arteaga and Mina Wu for your support in editing and publishing this blog post.