TIPS & TRICKS

Forecasting at Scale: How to Process Millions of Time Series using Prophet and DASK

Splunk customers process massive amounts of data every day. Most of this data is typically indexed as events or metrics and contains valuable information about how organizations are performing their cybersecurity, IT or business-related operations. Some data streams already consist of implicit time series data, others can be easily aggregated and transformed with SPL commands like tstats. Those results are often used to derive various key performance indicators (KPI) that can be important for business-critical decision-making or simply to “keep the lights on” in operations.

Forecasting The Easy Way

It’s a no-brainer to come up with the next idea to apply machine learning techniques to such time-series data to predict the future. This is where forecasting methods can do their magic and you find a few standard methods in Splunk’s Machine Learning Toolkit (MLTK) ready to use, like ARIMA and StateSpaceForecast. This works well for many cases and is quick and easy to do with MLTK. Just have a look at some of the forecasting examples in MLTK’s showcase area.

Scale Out Your Forecasts

When it comes to situations where you have to use different forecasting algorithms or libraries and you need to scale your forecasts over say millions of individual time series for each entity you are likely to hit some limits. The good news is that there is a solution pattern that was already implemented successfully for customers who utilize the Splunk App for Data Science and Deep Learning (DSDL), formerly known as the Deep Learning Toolkit (DLTK), which is freely available on Splunkbase. It helps to solve the main limitations mentioned above as well as:

  1. Leveraging any of your data scientist’s favorite forecasting algorithms such as the popular Prophet forecasting library which is ready to use in DSDL.
  2. Scale out your computation workload on an elastic container environment of your choice, e.g. Kubernetes, and utilize parallelization frameworks like DASK to distribute your forecasting computations across multiple nodes to as many CPU cores as you need to get the job done in the desired time.
  3. Integrate your results back into the Splunk platform and use the forecasts for alerting purposes and build interactive dashboards for your analysis and investigations.

How The Parallelization Works

Typically, when you need to solve a larger or more complex problem, you try to break it down into smaller parts that work with your limitations. Assuming you need to compute forecasts of KPIs for millions of entities on a daily basis, it can be easy to run into a computational challenge. The good news is that this type of challenge can be solved using parallelization to get the job done in less than 24 hours. By splitting the time series by entity key, you can initially distribute the data and the forecast calculations across multiple cores, e.g. using a framework like DASK. Finally, you can send the results back to Splunk HTTP Event Collector (HEC). The following diagram shows the core building blocks and the main DASK python command:

DASK python command

Results 

Execution time for forecastsWith this approach, you are also able to run different tests to investigate how to scale the computations. Running and building a million predictive models on a single CPU core typically takes over 23 days. By adding more CPU cores to the DASK cluster, users can achieve a 15-hour runtime on a 36-core cluster. This is enough to get the computations done in a faster time frame. You can also see on the chart that the approach scales linearly, so if you need to make more forecasts in the future, you'll know roughly how much additional computational cost it involves.

 

Wrap up

This blog described how you could scale out a specific forecasting use case for millions of entities with the help of DASK and Prophet. Both libraries are available with examples in the Splunk App for Data Science and Deep Learning, so you should have all the building blocks on hand should you need to build something similar. With the latest version 3.9, there is an easy-to-use HEC class available as well. It’s needless to say that many other modeling tasks can equally benefit from such a distribution strategy.

If you want to learn more about the Splunk App for Data Science and Deep Learning, you can watch this .conf session to explore how BMW Group uses DSDL for a predictive testing strategy in automotive manufacturing.

Happy parallelizing,

Philipp


Credits

A big and bold THANK YOU goes to my former ML Architect colleague Pierre Brunel for pioneering and proofing this solution in a production setup. Many thanks to Judith Silverberg-Rajna, Katia Arteaga and Mina Wu for your support in editing and publishing this blog post.

Philipp Drieger
Posted by

Philipp Drieger

Philipp Drieger works as a Principal Machine Learning Architect at Splunk. He accompanies Splunk customers and partners across various industries in their digital journeys, helping to achieve advanced analytics use cases in cybersecurity, IT operations, IoT and business analytics. Before joining Splunk, Philipp worked as freelance software developer and consultant focussing on high performance 3D graphics and visual computing technologies. In research, he has published papers on text mining and semantic network analysis.