What's New in MLTK Version 4.0

The Splunk Machine Learning Toolkit enables you to detect, predict and prevent what matters most to your organization. Modeling assistants, interactive examples and tutorials help you create and operationalize your own predictive analytics.

 


Video Transcript

Hi, and welcome to the MLTK 4.0 what's new video. It's October 2018. I'm going to go through the high-level concepts of all the things that we baked into the Machine Learning Toolkit, as well as some of our new add-ons for the Machine Learning Toolkit before we go into depth on each one of those.

So first, we're going to talk about the Splunk Machine Learning-based apps. These are available from Splunkbase.com. That's the Machine Learning Toolkit 4.0, new release here at .conf 2018. And part of the Machine Learning Toolkit is now a Splunk community for MLTK algorithms on GitHub. So now, you can go to GitHub and download a large variety of additional algorithms, and make them available in your individual copy Machine Learning Toolkit, as well.

We also have the Python for Scientific Computing app version 1.3. That was released with MLTK 3.4. And every so often, we're going to be updating the Python for Scientific Computing app that's about bringing new capabilities to the Machine Learning Toolkit from a library perspective. We have two additional add-ons.

We have an add on for the Splunk MLTK Connector for Apache Spark, which is in the limited availability release. To get a hold of this particular app, you need to talk to your Splunk sales team, but we will go over what it is, when we hit that part of the presentation. As well, we have the Splunk Machine Learning Toolkit Container for TensorFlow, which is only available through certified Splunk PS.

And this is about the Machine Learning Toolkit connecting to your GPA you infrastructure to run a TensorFlow deep learning neural network algorithm. So those are the pieces, and let's go through them. There's a lot of new content in Machine Learning Toolkit 4.0. Let's just briefly run through the list, and then we'll dive into each piece later.

The Conf 2018 Showcase Refresh is all about taking the showcase-- that's the first page that loads, when you first click on the Machine Learning Toolkit-- and extending the already 30-plus demos and tutorials to add new demos and tutorials based on our customer engagements over the last couple of years. These are end-to-end tutorials and examples for applying machine learning in Splunk to solve business problems.

We've also updated and enhanced our Experiment Management Framework so that any machine learning models that are retraining through time-- i.e. learning through time-- now record their history, record their accuracy, as they increase or decrease accuracy through time, into the EMF History panel. If you're interested in more about how the Machine Learning Toolkit learns, please watch the appendix video, or a separate video will be cut of that, as well.

There's a new machine learning command in the Toolkit. It's called the score command, and it's all about validating models and running important statistical tests for any use case. This has been a really big ask by customers. In fact, I'm going to make a call up for one particular type of validation-- kfold validation, which is a popular and powerful way to quickly test if your model is overfitting on the data.

We've also added additional algorithms-- LocalOutlierFactor, which is an unsupervised anomaly detection system. And we have enhanced the multilayer perceptron classifier. That's that neural network that we've already put into the Toolkit. We've enhanced it with a partial fit option for incremental learning. In addition, the MLSPL.conf file-- the critical configuration file that controls the amount of compute and resources used by machine learning in Splunk-- has been added as a UI to the Machine Learning Toolkit under the Settings tab.

The Machine Learning Toolkit downloads in safe mode with guardrails on to keep customers from putting the Machine Learning Toolkit into a production experience, and consuming a vast number of resources building a large machine learning model. This configuration-- files often changed by customers. And we'll go into some of those details when we get to it.

I've mentioned that we've updated the Python for Scientific Computing app. It is required for all new versions of the Machine Learning Toolkit. And then finally, we have the new Splunk committee for Machine Learning Toolkit algorithms on GitHub. This is a place for customers to easily share and leverage custom algorithms via manual installation with the MLSPL-API.

So the Toolkit is an extendable system, through the MLSPL-API, that allows you to bring in net new algorithms into the toolkit-- either your algorithms or one from the community. And GitHub enhances and facilitates that sharing. I mentioned we have launched two different add-ons for the Machine Learning Toolkit. I want to briefly go over them.

The Splunk Machine Learning Toolkit Connector for Apache Spark limited availability release is all about leveraging your current Spark cluster infrastructure for building large-scale models inside of the Splunk experience. So you'd bring your own Spark cluster, 2.2 version and higher. We have a connection system that connects the Machine Learning Toolkit experience to the Spark systems seamlessly, so you don't have to be a Scala programmer to leverage the power of Spark.

This is available via the Splunk data portal, and please reach out to your Splunk sales team. The reason why you might be interested-- if you're looking for truly large model building with Spark MLlib, a super popular distributed machine learning library-- if you want access to that via the Splunk system, via the command line, then this is a great product for you to go and use.

We have a UI wizard for establishing the connection and testing the connection between the two products. And not all Spark MLlib algorithms are pre-wrapped, and we don't have an extensibility store yet about you customizing those particular algorithms. This is not available in Splunk Cloud. The Splunk Machine Learning Toolkit Container for TensorFlow is available by a certified PS for installation.

It's all about the super popular open-source deep learning framework from Google, but now running and experiencing those deep learning algorithms from inside the Splunk SPL experience. This is an explicit ask from customers. And we're absolutely customer-driven, so thank you for making that ask, and we're happy to help you.

The purpose here is, if you have multiple CPU or GPU infrastructure, and you have a deep learning use case for neural networks, and you want to accelerate that deep learning system from the Splunk experience, this is the container for you. This is not about generic Splunk acceleration. This is only available, as I said, through certified PS-- we call that the PS ML SKU-- and it's only available in very specific operating systems-- OS X and Linux, for example.

The neural network experience is entirely by the SPL command line. We have not built a UI story for assembling neural networks. This is about leveraging your DPU infrastructure and your need for a particular neural network inside of Splunk. A single search head is leveraging that multi GPU or CPU system for machine learning.

You cannot use our streaming_apply to scale out under the indexers. We rely entirely on the additional infrastructure that you are making available to the container. It is unlikely that you will get real-time application on 100,000 plus events per second. When we talk about large real-time data set application of machine learning, you really need to stick with the Machine Learning Toolkit. This is not available on Splunk Cloud.

So let's talk about what we're really doing with Machine Learning Toolkit, and all these different add-ons and containers. This is all about a unified SPL experience, a unified search processing language experience from the Splunk perspective, for you to be able to access custom machine learning or advanced machine learning and other infrastructures with other external infrastructure computes, all through the seamless experience of using fit and apply.

So today, the Machine Learning Toolkit that you've already downloaded connects to Python for Scientific Computing, and exposes all the different algorithms and possible combinations, including custom ones, through the MLAPI extensibility, through the fit and the apply command. So you fit some algorithm onto a data set and persist a model onto disk.

That model is persisted. And then in real-time search or a later search, you can then apply that model without loading all of the previous training data. That's the inherent SPL experience. So part of that experience I mentioned was about being able to do extensibility, bringing in your own algorithms, or making your own from the different LEGO blocks that we have inside of the Python for Scientific Computing space.

We've launched a community for Machine Learning Toolkit algorithms on GitHub, and it's all about you being able to share with the greater community any additional open-source algorithms that you've built or used so that others can leverage those algorithms, or you can get feedback on the use of those different systems. This is something that you should be able to go and download from GitHub and install via the readme instructions, and be up and running very quickly.

This does require admin access to the Splunk system, and is not available in the cloud. So remember, work with your system app. When we talk about the extensibility parts, when we talk about keeping the Splunk experience of being able to use commands from the search pipeline to build machine learning models and then apply machine learning models, it's the same thing with the Spark MLTK Connector for Apache Spark.

This is about using a slightly different command-- sfit-- and being able to call the very popular MLlib algorithms in your own Spark cluster, persist the model on disk, and then later be able to call sapply to apply that model to net new data, and take action. This, again, is not available in Splunk Cloud. Now, the Connector for Spark, it goes through a wizard and tutorial that teaches you how to connect to the Spark system. It also lets you test the different configurations.

We've been running with a number of different customers, and this is a critical point for establishing trust between the two systems, and being-- making sure that you have all of your security concerns addressed. The Splunk Machine Learning Toolkit Container for TensorFlow via PS is all about continuing that same experience. You're going to use fit and you're going to use apply to create a neural network in the TensorFlow Docker.

You want to persist the model, and be able to apply that model then to other events, and take event-- and take action. I did make a callout about scale. And this is important back into the Machine Learning Toolkit can scale down to your indexers, because it's a native part of Splunk, while this is a Docker container leveraging external infrastructure that you've set aside specifically for deep learning. This is not available in Splunk Cloud today.

Now, the TensorFlow container comes with a number of examples. In the back, you can see the Machine Learning Toolkit container for classification using a neural network. We also have a couple of helpful setup pages and Docker detection pages to show you where your Dockers are, and which one you're connected to. The architecture is also part of the app. When you have it installed, you should be able to see the architecture, and really get a good reference to how the two products are working together. Again, this is not available in cloud.

So now that we've covered the additional pieces in the ecosystem, let's drill down into the new capabilities in the Machine Learning Toolkit 4.0. So the first thing I'd like to talk about is the Conf Showcase Refresh. So this is all about the showcase. That is the first landing page, whenever you launch the Machine Learning Toolkit, where we show you the 30-plus end-to-end examples of using machine learning to solve particular business problems in Splunk.

These are tutorials and demos that are also available under the Videos Tutorial link at the top of the showcase page. We've added additional ones, things like future VPN usage using sinusoidal time, or predict future VPN usage using categorical time, or even being able to predict external anomalies under the predict categorical fields. These are example workflows and example data sets that we've created to really tell the story, and give you a quick way of understanding how you might use machine learning in Splunk to solve a problem.

In addition, we have the Experiment Management Framework update, where we are now recording the statistics for all the model re-trainings. This is all about how machine learning models are put into production and are set to some sort of training schedule so that they update themselves, as new data appears, persisting that additional learning-- this is all about machine learning-- persisting that learning, as time rolls on.

So when you have an experiment and you're trying to have it update through time, you want to be able to monitor and measure the accuracy of that model, as time moves forward. And this is all about automating that process for you through the Experiment Management Framework. Please do check out this short video at the end, as well as the external video we'll make, as well, about how the Machine Learning Toolkit learns.

New for Conf 2018, we've also added the Machine Learning Toolkit UI for the MLSPL.conf file. I know that's a bunch of alphabet soup, but all Splunk apps have a conf file-- a configuration file-- that are about setting the configurations for that particular app. So machine learning is a resource that is a CPU runtime, memory intense process. So the Toolkit downloads in safe mode with guardrails on.

It's all about having the resources that machine learning is going to consume be constrained out of the box. Your admin then has access to be able to change that configuration file. But we found that users want to know what the current settings are, so they can recommend to the admin that they need some of the different distinct values changed. That's what this UI is for. It's to communicate with the user what are the current settings.

And as an admin, you can come to this interface and change the configuration file, as well, instead of going to the command line. This is limited in cloud. You need to open a cloud ticket to have any of these values changed. Please note that we also have tool tips for every single one of these pieces. So you can hover above, like so, and see what each one of the different configuration settings are.

So also, in the Machine Learning Toolkit 4.0, we've added a score command. The score command brings to the SPL command line a bunch of ways of validating a machine learning model, but also running statistical tests. So here, for example, is running the confusion matrix from a score command, instead of from a macro. And this means that we've actually brought that command more fully into the SPL experience so you can customize and change how you are scoring your models.

Over there on the right, I want to call that-- we are still communicating with you about what events are removed as part of the fit and the apply process. So score also will reject certain pieces of data, if they're badly formed, because you can't score on badly formatted data. And that sort of thing is covered in our documentation.

--by loading the Showcase and going to the Docs link, which will take you to the most recent docs for whichever version of the Machine Toolkit you're running. And in the user guide, we have detailed examples of all of the score command uses under a Custom Search Commands, Using the Score Command. And each section of classification, or clustering, or statistical tests that you want to access via the score command have a set of examples and detailed notes.

Kfold scoring method is a really popular and powerful way to be able to test if your model is overfitting or underfitting your data. What it does is you tell the kfold system how many splits, how many kfolds you want to split through the data. And internally, it will move the test and training sets randomly without overlap to show you the validation scores and testing for each one of the different folds.

So this is currently available for regression and classification models in Splunk as a native new command. Thank you. We're now going to go into an appendix video talking about how the Machine Learning Toolkit learns in Splunk. So when you load a search into Splunk-- maybe into the Machine Learning Toolkit, like the predict numeric fields-- you're loading some set of data.

Maybe it's last 24 hours, over there the right. Maybe it's the last 30 days. It's some training set of data. And we're going to go through that training data-- that large data set-- we're going to learn using the fit command, and we're going to save that learning onto disk. And the idea is that maybe you might have a rolling window. So say, instead of 24 hours, this is a 30-day load, and you were setting it as a scheduled search.

So that means every night at midnight, we load the past 30 days. We process all of that data. We save it onto disk onto that model file. So now, we have a rolling window-- that is Monday night at midnight, we load the last 30 days. We persist it as a model Tuesday night. We move forward one day. So our rolling window, our relative window of 30 days, loses the oldest day and gains the net new day. And we learn the new behaviors, and persist that data onto that model file again.

So if we think about this as a rolling way of trying to predict one day at a time, you have this 30-day window-- 30-day relative window on a scheduled search. You persist the data with the fit command. And then during the day-- during that blue period right there-- we use the pipe apply. We apply that data moving forward. And this is what's called batch learning.

So as we retrain the data, as every day goes on, we're losing the oldest day and we're gaining one net new day. Now, the problem with this is that you're constantly loading 30 days worth of data at midnight, or whenever your scheduled search is-- constantly learning that data. And you're not retaining any knowledge from before that 30-day window.

It's gone. So Monday 30 days ago-- let's see-- it was a Friday-- when we go on to Tuesday, you've lost that old Friday. You've lose the oldest day, and it's gone forever. But there is a way to change that in the Toolkit. There is a way to change that-- online learning-- or batch online learning, in this case. And it looks exactly the same. It's just some search that we load into the tool kit. We persist that data as some sort of scheduled search, so it's learning.

But now, we're going to use partial_fit. So this was the original idea. Instead of fit, we're going to use partial_fit. It's actually a flag inside of the fit command. It only works with very specific algorithms. And the idea is that, before, we had this 30 days worth of data. So we load the 30 days on Monday, we persist the data on disk, and we predict Tuesday with it.

But then for partial_fit, we no longer need to load the 30 days. We've saved that information into an extended model file. So now, all we have to do is load the entirety of Monday-- that light blue bar-- we load just the net new data with the partial_fit. It then learns the net new behaviors, retaining the history of the past inside of its mathematical brain, and we persist that new model on disk to predict Tuesday, for example.

And we keep doing this. So you're doing the same kind of retraining, but now the incremental load every night is not the 30 days. Instead, you're using that partial_fit to incrementally move yourself forward. The apply step is the same. It's that incremental partial_fit. Now, only certain algorithms allow you to use partial_fit. That's why we-- I called it out under the MLP classifier-- of the multilayer perceptron classifier that we added to the Toolkit 3.4.

There are very few algorithms that support it. Check the docs. This is about the algorithms, not Splunk. Splunk is using the open-source libraries. Whether those open-source libraries-- whether the math they're using is actually advanced enough to support partial_fit is a big question, and you can check that in the docs. Remember that that slowly builds up a larger model file.

Now, for the most part, that's not really going to be a big issue. You're talking about a very small incremental amount of information. Remember that, as you move through time, you're going to want to start looking at things like what is the weight of additional time as-- time data, as opposed to historic? So let's say I keep partial_fitting over-- 60 days into the future. So every day, I'm loading just the one day over and over again, and adding the net new behavior.

After 60 days or so, do I want to downweight data that came from 60, even 90 days, in the past, or do I want it to have the same weight as what happened yesterday? So these are the questions that you need to answer, when you're building a partial_fit model. Thanks for listening, and I hope you have a great day.