Machine Learning Toolkit and Showcase 1.1

Machine Learning Toolkit and Showcase 1.1


Video Transcript

Here we have a new copy of Splunk installed with machine learning tool kit and showcase, and the supporting app of Python for scientific computing. Let's go to the showcase and explore some of the pre-populated use cases through the modeling assistance. Please note that the modeling assistants are not meant to guide the user through the exhaustive list of machine learning algorithms available, simply a subset of uses our customers are most interested in. Let's explore the app.

The showcase is organized around the modeling assistance. Each modeling assistant gives a guided machine learning experience. For instance, predicting numeric fields, predicting how to [INAUDIBLE] fields, and detecting numeric outliers. Each modeling assistant link is followed by a short paragraph describing its use case, and which algorithm is being used.

There are a series of examples drawn from IT, security, and business analytics. If you want to filter these by example type, simply select the subgroup, and the list of pre-populated stories will be shortened. I'm going to go ahead and leave it on all examples.

Let's go through one of the examples, say, predicting server power consumption. Now the modeling system is currently predicting a numeric field and my algorithm will be linear regression. This is not the only algorithm capable of predicting a numeric field, but linear regression will be a good starting point for exploring the modeling assistant and the guided machine learning experience. Let's go ahead and click on it.

In this pre-populated assistant, we're going to predict the power consumption, specifically, the AC power field from the data set in the search bar. My search bar is currently set to all time and I'm loading a CSV, but this could be any search in Splunk, and you'll be able to select any time range as per a normal search.

After I run the search, the modeling assistant populates the dropdown, allowing me to select which field I want to predict, and then allows me to select which fields I want to use for predicting, including being able to remove a field or add it back in. I can then split my entire time window into training and test by percentage, in this case, 50-50. If I wanted to, I could slide that back and forth and change the percentages for the training and test set.

And finally, I save the model into a named look up for later use, and this is very important because simply applying an algorithm to a set of data and creating a model still requires you to persist that model in Splunk for later use. When I click Fit Model, Splunk is running linear regression in the background to predict the AC power using these fields as features on the first 50% of my time range picker. That's the training data set, and it's saving those coefficients to a look up that it is then applying to the test set, the later 50%.

The modeling assistant has a series of preselected visualizations for you to validate your model. This preselected set of visualizations changes, depending on which assistant we're in. For predicting the numeric field, we're interested in the actual risk predicted effect. So whether those errors are clustered around a particular center here in the histogram, or if we have a series of points that more or less lay on the predicted path, we want to probably know what the r squared is or the root mean square, and we probably want to be able to access the actual parameters.

All of these are pre can visualizations simply reading in the output of applying that model to a set of data. And all of this is done for you by the assistant, all by hitting the Fit Model. Now, at the very bottom, the system's actually written SPL for you, because ultimately, all of these are searches. And in Splunk, all searches are written in SPL.

So we can actually see the very first part of my search, input lookup server power.csv, and then it actually has written the SPL based on the interaction above, where I'm selecting a specific field, AC power to predict from a set of fields. And I'm saving it into a named look up. You can click the link above any of the SPL, and Splunk will copy the SPL into a new search.

In that new search, you can customize your model to your heart's content including the leading a feature, changing the name of the file that you're saving your model to, and because the time range picker has now been reset, it is not using a training and test split that came with the guided machine learning experience the assistant brought us, it is using all of the data to build this model and save it as sample A. So this model will actually be different from the one we just created in the assistant and will also be named appropriately different.

Now, any one of these can be brought up as another search including the graphics, where it will actually pull up the visualization that was used by the assistant for her validation of your model. Like any panel in Splunk you can save this to dashboard or report. Let's go through another assistant. I'm going to go back to the showcase, and this time I'm going to predict categorical fields instead of numeric fields, and I'm going to be using the algorithm logistic regression. And I'm going to predict the telecom customer churn.

So in this data set, there's a field called churn question mark, which is either true or false. It's a category. I'm going to use a set of fields, a set of features to predict a true or false customer from churning, and I'm going to change my percentages, so I'm going to have a 70% training and 30% test. I'm going to save it as example churn.

So in the background, logistic regression is being called and all the data is being loaded for the training set. It's creating the coefficients, saving them to that example churn file, and then it's applying it to the test set, the later 30%. This time, because it's a different assistant, we simply have the validation of the model and the classification or confusion matrix to tell us how accurate or how inaccurate the model was.

In this case, it was relatively accurate in the predicted false, when it was false, and relatively accurate when it predicted true in the true section. But it wasn't perfect. It was just a 77% accuracy, which is good enough for our example. When we're fitting that model and we want to actually look at the SPL, we can, again, click any one of those links and we can pull in the entirety of that SPL for us to continue to modify by hand.

Perhaps I want to remove night margins. But I still want to save it into example churn from this set. So I hit Enter, and away it goes and it's already saved it over the previous name. Now that we've gone through two of the pre-populated stories, let's go back to the documentation and look at the commands that the ML toolkit is bringing to the SPL language.

We can get to the docs easily by clicking the docs link at the top of the page. This takes us to Splunk docs, and we can go to the user's guide, getting started, and custom search commands. There are a couple of custom search commands, but the machine learning tool kit and showcase extends the SPL language with fit, apply, summary, et cetera. Each of these commands is explained here.

Now, when we fit a algorithm to a set of data, there is usually a series of parameters that we can additionally use. For example, when we're fitting logistic regression, the customer churn, we specified what field we're going to predict, and then we specified all the fields that we contribute and then saved it into a particular model name, but there was also an additional option. Fit intercept, true or false, which we didn't talk about in the assistant.

And if you're interested in using those additional parameters, you need to go in and actually look at the logistic regression estimate, found by following the link here to Scikit learn, where the actual underlying algorithms that we are exposing are coming from. In summary, Splunk machine learning tool kit and showcase allows the user to use customized machine learning in their SPL searches.

The showcase is organized around the modeling assistance, which are here to give you a guided machine learning experience in Splunk. Happy Splunking.