TIPS & TRICKS

Simple Correlation in Splunk

As I promised at .conf, I’m going to start posting a series on writing effective correlation searches, in the hopes that I will get better at doing so.

First, framework. Alberto Cairo’s The Functional Art has a good summation of DIKW (Data, Information, Knowledge, Wisdom) Hierarchies. In short, we’re going to structure our search in a way that lets us gather Data, structure Information, and return Knowledge. This is what I called the correlation three-step in my .conf talk on Technology Add-ons: Gather a pool of Data, structure or extract Information for testing, test to acquire Knowledge. Hopefully that will lead to Wisdom, but any gaps are left as an exercise for the reader.

In order to keep it simple, let’s work with a basic correlation across two sorts of data. My wife and mother-in-law are rabid San Francisco Giants fans, so let’s see… what’s the correlation between Matt Cain pitching a strikeout and someone in the house hitting a related website?

It just so happens I’ve got both of those data sources, thanks to Splunk for MLB Statistics and Splunk for Squid, so let’s walk through the correlation three step.

Gather a pool:

With a little searching, I find that Major League Baseball stats are in one index, and there’s sourcetypes which let me filter out the most interesting stuff. A couple of simple searches reveal Matt Cain’s player number, and the log format makes it easy to construct a good search for Cain pitching a strikeout:

index="mlb_stat" sourcetype="game_events_*" pitcher=430912 event=Strikeout

I’ve already worked enough with Squid data to know that I’ve got all the events and only the events I want in sourcetype=squid, so all I need to do there is specify sites. Since I’ve written a Technology Add-on for my Squid data, I have a dest_host field that gives me just what I want.

sourcetype=squid (dest_host=*sfgiants.com OR dest_host=*majorleaguebaseball.com)

Now I just want to OR them together into a pool:

(index="mlb_stat" sourcetype="game_events_*" pitcher=430912 event=Strikeout) OR (sourcetype=squid (dest_host=*sfgiants.com OR dest_host=*majorleaguebaseball.com))

Structure it for testing:

Now I’m into some harder stuff; at least for me… I want to sort the data events into buckets. This isn’t hard to read, but it did take me a little documentation searching and experimentation to get it right. To make sure I was getting data, I started with a bucket size of one day, then fell back to an hour when I was happy with the results. Note that I’m looking for correlation rather than causation, so the order of events isn’t necessarily a given… still, I’ve sorted by _time so that I can see whether the site visit occurred first or the strikeout did.

(index="mlb_stat" sourcetype="game_events_*" pitcher=430912 event=Strikeout) OR (sourcetype=squid (dest_host=*sfgiants.com OR dest_host=*majorleaguebaseball.com)) | sort _time | bucket _time span=1h

This is still an event list though, I have one more step before it’s easily tested:

(index="mlb_stat" sourcetype="game_events_*" pitcher=430912 event=Strikeout) OR (sourcetype=squid (dest_host=*sfgiants.com OR dest_host=*majorleaguebaseball.com)) | sort _time | bucket _time span=1h | eventstats dc(sourcetype) AS sourcetypecount by _time

Now I have sourcetypecount, a field which will let me know when my two types of events are correlated (in theory it could break if the MLB changes their format, so be aware of that if you’re ever doing this FOR REAL WITHOUT A NET).

Test it:

All that’s left is a test — if sourcetypecount is over 1, we’ve got correlation. Form a results table and you’re ready to use this as a trigger for a script:

(index="mlb_stat" sourcetype="game_events_*" pitcher=430912 event=Strikeout) OR (sourcetype=squid (dest_host=*sfgiants.com OR dest_host=*majorleaguebaseball.com)) | sort _time | bucket _time span=1h | eventstats dc(sourcetype) AS sourcetypecount by _time | where sourcetypecount>1 | table sourcetypecount,_time,des,dest_host

Next time I’ll go into a more complex correlation; until then, that MLB data source is pretty fun to play around with!

Results

Results table

----------------------------------------------------
Thanks!
Jack Coates

Splunk
Posted by

Splunk