Getting Started with Machine Learning at Splunk

I’m sure many of you have heard of our Machine Learning Toolkit (MLTK) app and may even have played around with it. Some of you might actually have production workloads that rely on MLTK without being aware of it, such as predictive analytics in Splunk IT Service Intelligence (ITSI) or MLTK searches in Splunk Enterprise Security.

A recurring theme during my time at Splunk — and something we often hear from colleagues who don’t work directly with MLTK — is that people are unsure where to start with machine learning (ML).

Here I’d like to take you through some of the concepts and resources that you might need to get familiar with to use MLTK in your Splunk instance. I’ll also highlight some of the new content we’re working on to help you get more insight from your data using ML.

What Makes ML Different?

Typically at Splunk, when you’re trying to analyze a dataset or find a needle in a haystack, a single SPL search is enough to get the information you need. With ML-based analytics though, you have to train an ML model first, which will subsequently be used to derive the insights you need.

This may seem like an overly complex process compared to what you may usually do in Splunk, but it’s really no different to using lookups! If you can create a lookup from your data that you later use to enrich search results — such as generating a list of IPs from known malicious sites that you then use to trigger alerts as new data comes in — then you can use ML. In more detail, the outputlookup command performs a broadly similar function to the fit command in MLTK, with lookup having parallels with the apply command from MLTK. If you’re interested, you can learn more about exactly how fit and apply work here.

What Have We Done to Help Already?

There is a whole host of content in MLTK to help you get started. Many of the showcases that ship with the app take you through guided examples of the model training and model application process. The Experiments and Smart Assistants are there to help you develop your own ML-based analytics, all via a guided user interface that means you don’t need to know how the fit and apply commands operate.

For those who are more comfortable with SPL, however, there is a wealth of content available on our Splunk Blogs site, with long-term stalwarts like the cyclical statistical forecasts and anomalies series providing detailed SPL examples that you can copy into your own environments alongside more recent gems like a Splunk approach to baselines statistics and likelihoods on big data.

My personal favorite though is the wealth of details that we have in our .conf archives, with use cases and examples of how ML has been used to gain valuable insights coming directly from our customers.

What Are We Doing Now?

MLTK tutorials! We’ve spent a load of time recently working through the most common use cases we see for ML at Splunk and have started documenting them as follow-along tutorials to make it easy for you to pick up how particular ML techniques and analytics can provide specific insights.

The first of these is an example of how you can detect anomalies in your data ingest pipelines. This is based on a superb piece of work described by Abe in his blog here, but we thought we’d give you all of the details for how you can implement it yourself in our MLTK docs too!

You can follow along with this tutorial to:

  1. Train a model that estimates how much data is being generated by a particular sourcetype at a particular time of day.
  2. Compare this estimate to the actual data volume to calculate a Z-Score statistic, which describes how close the estimate is to the real value.
  3. Put all of this together in a dashboard or an alert that can help you identify periods of time when a given sourcetype is creating either too much or too little data compared to what is expected.

I’d encourage you to check out the article and try it out for yourself!

In addition to this tutorial we have also provided some content for our advanced users too. I have often been asked by customers if it’s possible to train a model outside of Splunk and bring it to MLTK. Well, we’ve now provided guidance on how to do this through extension of our ML-SPL API. Our amazing Security Research Team recently put together some content for detecting potentially malicious command line strings by first training a model outside of Splunk, then importing the trained model into MLTK—and we thought that we should share the goodness with you all!

There are three phases to bringing a model to Splunk:

  1. Train a model in your environment of choice.
  2. Encode that model so that it can be read by MLTK, noting that you may need to add a custom algorithm to MLTK as well.
  3. Drop the model into the lookups folder of the app you want to use it in.

Now go ahead and start bringing your pre-trained models to Splunk.

So What Next?

Well, first of all, I’d recommend that you go and download MLTK and get started by trying out the ingest anomaly detection technique that we’ve wrapped up for you in our docs. Check out this awesome Tech Talk if you want to find out about some alternative ways of detecting anomalies in your data. Keep an eye out for more of these tutorials too—we will be releasing more of them over the coming weeks and months.

I’d also encourage you to grab the latest release of the Security ES Content Update pack where you can find our pre-trained MLTK model with the new potentially malicious code on commandline analytic. If you’re feeling really adventurous, you could also try training a model and bringing it to Splunk using our docs.

I’m sure you will have also seen that .conf22 is happening a little earlier than usual this year from June 13-16. As with most years, we hope to celebrate some of our amazing customer wins with MLTK, so watch out for ML focused talks once the sessions are confirmed!

Happy Splunking!

Related Articles

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease
Platform
2 Minute Read

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease

Splunk POD is designed to simplify your on-premises data analytics, so you can focus on what really matters: making smarter, faster decisions that drive your business forward.
Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights
Platform
3 Minute Read

Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights

Announcing the general availability of the new workload dashboard – a modern and intuitive dashboard experience in the Cloud Monitoring Console app.
Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ
Platform
5 Minute Read

Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ

The heart of our momentum at Cisco Live APJ is our deeper integration with Cisco, culminating in the Splunk POD and new integrations, delivering unified, next-generation data operations for every organization.
Dashboard Studio: Token Eval and Conditional Panel Visibility
Platform
4 Minute Read

Dashboard Studio: Token Eval and Conditional Panel Visibility

Dashboard Studio in Splunk Cloud Platform can address more complex use cases with conditional panel visibility, token eval, and custom visualizations support.
Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard
Platform
4 Minute Read

Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard

Introducing Resource Metrics in Workload Dashboard (WLD) – a modern and intuitive monitoring experience in the Cloud Monitoring Console (CMC) app.
Powering AI Innovation with Splunk: Meet the Cisco Data Fabric
Platform
3 Minute Read

Powering AI Innovation with Splunk: Meet the Cisco Data Fabric

The Cisco Data Fabric brings AI-centric advancements to the Splunk Platform, seamlessly connecting knowledge, business, and machine data.
Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades
Platform
3 Minute Read

Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades

Simplify fleet-wide upgrades of Windows Universal Forwarders with Splunk Remote Upgrader—centralized, signed, secure updates with rollback, config preservation, and audit logs.
Dashboard Studio: Spec-TAB-ular Updates
Platform
3 Minute Read

Dashboard Studio: Spec-TAB-ular Updates

Splunk Cloud Platform 10.0.2503 includes a number of enhancements related to tabbed dashboards, trellis for more charts, and more!
Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises
Platform
2 Minute Read

Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises

Announcing the introduction of Edge Processor for Splunk Enterprise 10.0, designed to help customers achieve greater efficiencies in data transformation and improved visibility into data in motion.