MACHINE LEARNING

Leveraging External Data Science Stacks with Splunk Today

Are you a data scientist who wants to use the data in Splunk but without using SP? Maybe in a research or experimentation platform? Well, this blog should help.

I’m going to assume using the Splunk Machine Learning Toolkit (MLTK) with the Python for Scientific Computing Add-on isn't the option you want to pursue. Maybe you’re not motivated to learn SPL or you want to leverage your established research / experimentation environment. The Splunk MLTK is built with workflows (that we call Assistants) for guiding SPL writers through the journey of getting data in, experimenting and validating their models, and operationalizing their machine learning models quickly in Splunk, applying the learned insights to real time data and leveraging the platform's many operational offerings (scheduled and triggered re-training, alerts, dashboards, reports, premium product workflows like Splunk IT Service Intelligence or Splunk Enterprise Security—all the things that make Splunk the best machine data analytics platform).

There is one bit about Splunk we have to start out with—how do you get access to the data? It’s a secure platform, so you do need a few things.

Making Friends with Your Splunk Admin

You are going to need three things from your friendly neighborhood Splunk Admin: 

  1. An account with the right credentials to access the data,
  2. Access to the REST endpoint for your splunk deployment,
  3. Some light SPL work from the Admin (or friendly power user) to fill your data frame. Don’t run—SPL won’t bite you.  

Once data gets into Splunk you can only really export or manipulate data through SPL. Don’t worry, this isn’t a requirement to get you to learn to use Splunk or SPL (I mean, you should, but I’m biased). Just explain the schema you need to your admin or power user and they can generally write the two or three lines of SPL you will need. If you're looking for say the CPU, memory, and login rate for your IP gateway every minute for 30 days, they might write the following for you (shown here as an example using the call data set from an earlier blog I wrote called "Cyclical Statistical Forecasts and Anomalies - Part 1").

| inputlookup CallCenter.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats avg(count) as avg stdev(count) as stdev by HourOfDay,BucketMinuteOfHour,DayOfWeek,source

I’m showing you the SPL because we'll be using that string in some of our examples below. Remember that you need to download the sample data and create a lookup in Splunk to follow the exact steps below—you can, of course, simply change the SPL to whatever your friendly admin writes for whatever data you are after.

Red or Blue Pill, REST and Dataframe Dating Games

I like dataframes. At least one person reading this blog will say, “What about Numpy or matrix objects for linear algebra?” and of course you can easily insert Numpy or any other data object/container into the examples we are about to walk through. But I like dataframes. The Splunk MLTK is really all about seamlessly bridging back and forth from Splunk’s SPL stream to a dataframe you can run anything you want to via the MLSPL API. We can also use the REST endpoint of Splunk to fill a data frame in your research stack. Let’s see what that looks like.

The Blue Pill (Python)
After you start up your Jupyter Notebook ( Download):

!pip install splunk-sdk
 import splunklib.results as results
 import splunklib.client as client
import io, os, sys, types, datetime, math, StringIO, time
# Data Manipulation
import random
import numpy as np
import pandas as pd
 # Your Splunk Instance
 HOST = "localhost"
 PORT = 8089
 USERNAME = "admin"
 PASSWORD = "AndrewsTerriblePassword"
# Create a Service instance and Attempt Connection to Splunk
 try:
    service = client.connect(host=HOST, port=PORT, username=USERNAME, password=PASSWORD)
    print "Connection Successful"
 except Exception,e: print str(e)
# Function to Perform a Splunk search
 def execute_query(searchquery_normal, 
                  kwargs_normalsearch={"exec_mode": "normal"}, 
                  kwargs_options={"output_mode": "csv", "count": 100000}):
    # Execute Search
    job = service.jobs.create(searchquery_normal, **kwargs_normalsearch)
 
    # A normal search returns the job's SID right away, so we need to poll for completion
    while True:
        while not job.is_ready():
            pass
        stats = {"isDone": job["isDone"], "doneProgress": float(job["doneProgress"])*100, 
                 "scanCount": int(job["scanCount"]), "eventCount": int(job["eventCount"]), 
                 "resultCount": int(job["resultCount"])}
        status = ("\r%(doneProgress)03.1f%%   %(scanCount)d scanned " 
                  "%(eventCount)d matched   %(resultCount)d results") % stats
 
        sys.stdout.write(status)
        sys.stdout.flush()
        if stats["isDone"] == "1":
            sys.stdout.write("\nDone!")
            break
        time.sleep(0.5)
 
    # Get the results and display them
    csv_results = job.results(**kwargs_options).read()
    job.cancel()
    return csv_results
 splunk_query = """
 | inputlookup CallCenter.csv
 | eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S") 
 | bin _time span=15m 
 | eval HourOfDay=strftime(_time, "%H") 
 | eval BucketMinuteOfHour=strftime(_time, "%M") 
 | eval DayOfWeek=strftime(_time, "%A") 
 | stats avg(count) as avg stdev(count) as stdev by HourOfDay,BucketMinuteOfHour,DayOfWeek,source
 """
csv_results = execute_query(splunk_query)
csv_results_pandas = pd.read_csv(StringIO.StringIO(csv_results), encoding='utf8', sep=',', low_memory=False)

Throw a party! You have a dataframe filled with data from Splunk!

The Red Pill (R)
Maybe you are an R person—more power to you. I’m a huge fan of IDEs, so I encourage you to use R Studio or another IDE product.

Example R code (Download): 

search_now <- function(search_terms, ...) {
 require(httr)
 splunk_server <- "https://localhost:8089"
 username    <- "admin"
 password    <- "AndrewsTerriblePassword"
 search_job_export_endpoint <- "servicesNS/admin/search/search/jobs/export"
 response <- GET(splunk_server,
                 path=search_job_export_endpoint,
                 encode="form",
                 config(ssl_verifyhost=FALSE, ssl_verifypeer=0),
                 authenticate(username, password),
                 query=list(search=paste0(search_terms, collapse="", sep=""),
                            output_mode="csv"),
                 verbose(), ...)
 result <- read.table(text=content(response, as="text"), sep=",", header=TRUE,
                      stringsAsFactors=FALSE)
 return(result)  
}
call_center_search <- ' inputlookup CallCenter.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats avg(count) as avg stdev(count) as stdev by HourOfDay,BucketMinuteOfHour,DayOfWeek,source'
call_center_data <- search_now(call_center_search)
summary(call_center_data)

Throw a party! You have a dataframe filled with data from Splunk!

Most Common Issues with These Pills

There are two common failure points for getting these workflows setup in Splunk.

First, did you make friends with your Splunk Admin? You need the correct port configuration (perhaps a firewall configuration to allow you to connect), an account with data access AND the correct SPL. Your admin can test the SPL directly in Splunk—if it works there with your account, then it should work in your custom data science environment.

Second—and this is for the Splunk Admin or poweruser working with you—when looking at the API call you may need to paste the spl command “search” in to the REST api SPL string.

So in R for example you might want:

                 query=list(search=paste0("search ", search_terms, collapse="", sep=""),

Remember the string from the API call should run directly in the search window, and of course you can find the response in Splunk by looking at the |history command to triage. Other common SPL options to paste into the sent search string include earliest/latest, index, sourcetype etc.

What About Operationalization?

I mentioned the MLSPL API before. Great examples of using this framework for putting custom data frame manipulation or an algorithm can be found here. Note the very simple example—all the data from the Splunk pipeline is passed into the dataframe with some automatic preprocessing clean up. You as a data scientist just have to write your mathematical transformations against the data frame object df, serializing your learned models with the codecs provided. You can even make data manipulations directly in your data frame (like .iloc[ ] or what have you) and return those transformations directly to Splunk. Maybe you want to leverage the alerting in Splunk, for example. Have fun!

Splunk MLTK Container for TensorFlow™ (via Splunk’s Professional Services) will also get you a base container running Python 3 and other libraries you can customize and give GPU/multi CPU access for accelerated ML workflows, all with the easy operationalization option from the SPL command line for your Splunk Admin and power users to quickly operationalize your machine learning in a production environment.  

Many thanks to the Spunkers who attended the analytical bootcamps over the years, especially Tyler Muth for his “What about Rrrrrrr!” and Daniel Martinez Formoso (we miss you!) for his notebook work.
 


This blog was written by Andrew Stein and Manish Sainani

Andrew Stein
Posted by

Andrew Stein

Prior to this role he served as a Splunk SE Architect for Business Analytics, IOT, and Machine learning. Andrew has spent the last 18 years building and selling machine learning outcome  in startups in finance and IOT. When not writing PRDs or working with customers, Andrew grows exotic fruit and tends to the demands of his feline owners.

TAGS

Leveraging External Data Science Stacks with Splunk Today

Show All Tags
Show Less Tags

Join the Discussion