- several bug fixes
- library updates
- drill-down links on the container management dashboard
- four new algorithm examples that benefit from the latest update of the libraries available in the Golden Image CPU 3.6
Let’s get started with the new operational overview dashboard which was built using Splunk’s brand new dashboard studio functionality which I highly recommend checking out. You can learn more about it in this recent tech talk which you can watch on demand.
Automated Machine Learning
Building machine learning models may require you to iterate over your model to find the best set of parameters for a given dataset. Doing this manually requires quite some time so you might want to think about ways to automate this. We discussed a gridsearch based approach in the blog post of DLTK version 3.4 release. Recently, my brilliant colleague Lukas worked with autosklearn, an automated machine learning library and contributed a notebook and dashboard example on how to apply this concept to a given dataset. Autosklearn allows you to automatically run various machine learning algorithms over your dataset, build models, score them, further tune them and finally provide you with an optimal model that was found based on a few user defined constraints. However, this comes at the cost of compute, something to be aware of if you decide to apply it to your datasets.
The following dashboard shows the results of a classification model that was automatically built on top of a given example dataset. Under the scoring results you can find the model summary which shows more details about the automatic model building process. This approach can be useful in two typical scenarios:
- You have already built a model, either with Splunk’s Machine Learning Toolkit or some other techniques, and you want to know if it can be further improved.
- You have an unknown dataset and want to simply get a model built on automatically without spending time on it manually determining the right algorithms and their parameters.
Last but not least, please keep in mind this approach helps you automate the modelling part, but it still requires you to think and really understand the business problem you want to solve and what datasets you actually decide to use for the modelling part.
Robust Random Cut Forest
In the last blog post of DLTK version 3.5 we discussed various new approaches for anomaly detection, especially in time series data. My colleague Greg recently wrote about cyclical statistical forecasts and anomalies. But of course there are even more useful techniques you may want to use. One of them is the so-called Robust Random Cut Forest, which is certainly not a tool for randomly cutting trees in your favourite forest nearby, but a tree based modelling technique that can provide you with insight, telling you which of your data points are the most anomalous ones in your dataset. As you can read from the chart below the red line marks the outliers found in a simple time series data set. The typical cyclical behavior is automatically learnt to be normal and the parts that show stronger deviations are flagged as outliers.
Time Series Decomposition
Another useful preprocessing technique when working with time series data is to decompose it to analyze if there is a seasonality and trend contained in the data. This technique is often referred to under the acronym “STL” and is helpful for forecasting scenarios, but also for anomaly detection. The example below shows how the raw time series data is decomposed into its season and trend parts and the remaining residual which can then subsequently be used to detect anomalies as you can read from the panel at the bottom of the dashboard.
If you want to apply this technique directly on your stream of data, Splunk’s Data Stream Processor has an online STL algorithm available as well.
Sentiment Analysis with spaCy
Analyzing sentiment in natural language text data can be a very useful tool to better understand customer interactions with business services, product reviews or even social media feeds. Despite the challenge of language ambiguities and the complexity of working with natural language data, libraries like spaCy can still provide you with an easy starting point for building those kinds of analytics. The following example shows how a basic sentiment analysis can easily be applied to any text data available in Splunk. The results like the sentiment polarity or subjectivity can be further used then for investigation and further aggregations or trend analysis.
If you want to apply sentiment directly on a data stream with the help of Splunk’s Data Stream processor, then you can check out the sentiment analysis functions there.
Finally, I want to give kudos to three contributors. A big shout out goes to Andreas Zientek from Zeppelin for his super useful contribution to the container management dashboard. Secondly, I want to thank my colleague Lukas Utz for all his work to make auto-sklearn easily accessible in DLTK. And another big thanks to Greg Ainslie-Malik who contributed all the other examples as a result of recent customer engagements.
We all hope that you can make good use of those new techniques and get even more value from your data in Splunk! If you have any questions, please engage with the community or reach out to us - we’re happy to help with your MLTK or DLTK use cases!