This guide describes using Splunk to generate data which can be used by the code_swarm visualization package. While code_swarm was designed for visualizing code repositories, it can also be used to visualize other data sets which contain a many-to-moderate relationship between different event objects.
This guide also introduces you to building custom endpoint handlers with Splunk's AppBuilder, and illustrates how to manipulate and format the data coming out of Splunk via REST. You will be required to cut and paste code in this guide, but knowing how to code is not required.
The following samples were generated from combined access logfiles from Zoto, a photo sharing website. Several interesting visualizations can be created illustrating event object relationships, including those between user accounts, photos, and IP addresses accessing the site.
A short video is available over on Vimeo showing IPs access files on a web server.
IP Address accessing photos in specific user accounts. Bright dots are the more popular accounts, being accessed by multiple IP addresses
Accounts and their photos. Dots are unique photos belonging to a particular account, and collect around the account as they are accessed.
The code_swarm package was designed for animating the 'replay' of code repository checkins for a given software project. Written specifically for code repository logs, it is useful for visualizing large numbers of commits occurring over a given time range, by a large number of authors, to a moderate set of files. As long as your data set very nearly matches those of a code repository - i.e. large numbers of events representing a moderate number of resources accessed by a large number of entities over time, you should be able to visualize it with code_swarm and Splunk.
In the following example, we'll take a similar data set - accesses to a website, by a large number of IPs, from a moderate number of page referrers.
The code_swarm package should run fine on Linux, Windows, and OSX. To install it, you'll need to checkout code_swarm from Google Code via SVN:
$ svn checkout http://codeswarm.googlecode.com/svn/trunk/ codeswarm-read-only
Next, you'll need to test code_swarm runs OK on your box:
$ cd codeswarm-read-only/ $ ./run.sh
You should get an output window showing a sample animation. If you have difficulties, try visiting the code swarm site.
You're going to need to have Splunk downloaded and installed for this next part, so go off and do that if you haven't already. Also, now is as a good of time as any to set your environment variables for working with Splunk. We're assuming you have Splunk installed in /opt/splunk, here, so make adjustments if necessary.
$ source /opt/splunk/bin/setSplunkEnv
If you don't want to mess around with using AppBuilder, you can simply download the Splunk Swarm app, untar it, and put it in your $SPLUNK_HOME/etc/apps directory. Once you've done that you'll need to restart Splunk:
$ /$SPLUNK_HOME/bin/splunk restart
Skip to the Swarm Mashing section below to continue.
Next, we'll need to use AppBuilder to tell Splunk how to export the XML format that code_swarm uses to generate its visualizations. AppBuilder is a simple script-based utility for creating Splunk based applications. Start by downloading the AppBuilder tarball. untar it, put it in your Splunk 'bin' directory, and make it executable. Your mileage may vary depending on your particular operating system, and tools installed.
$ curl -O http://splunk-appbuilder.googlecode.com/files/app_builder.0.6.tar.gz $ tar xvfz app_builder.0.6.tar.gz $ mv appbuilder /$SPLUNK_HOME/bin/ $ chown 755 /$SPLUNK_HOME/bin/appbuilder/app_builder.py
Now run appbuilder:
$ /$SPLUNK_HOME/bin/appbuilder/app_builder.py
You should get the usage help, with no errors displayed, and be dumped back to the command prompt.
Using AppBuilder, we're going to create a few endpoints which will allow splunkd to serve the XML content we'll need for code_swarm. Begin by creating a 'swarm' application with AppBuilder:
$ /$SPLUNK_HOME/bin/appbuilder/app_builder.py create swarm
When App Builder asks you if you want to restart Splunk, say no. We're going to create another handler first:
$ /$SPLUNK_HOME/bin/appbuilder/app_builder.py add swarm endpoint xml
This time, say yes to restarting Splunk. While it's restarting, copy the following code from here and replace the section under handle_GET class in the xml_handler.py file located in $SPLUNK_HOME/etc/apps/swarm/rest/.
def handle_GET(self):
# enter your search here - use the test.py script to test your search/results
search_string = 'search sourcetype="access_combined" startminutesago=120 | where referer_domain > "" | where clientip > "" | where file > ""'
# start our job
job = search.dispatch(search_string, sessionKey=self.sessionKey)
while not job.isDone:
logger.debug('hanging out till search comes back...')
time.sleep(1)
# start XML object - we'd use some XML lib, but this is way easier
output = ['<?xml version="1.0"?>\r\n<file_events>\r\n']
# build the XML up
for result in job.results:
filename = result['clientip'] # aliasing clientip to filename
author = result['referer_domain'] # aliasing referer_domain to author
# haul in epoch date
epoch = int(splunk.util.dt2epoch(result.time))
output.append('\t<event filename="%s" date="%s" author="%s"/>\r\n' % (filename, epoch, author))
# clean up
job.cancel()
# finish XML object
output.append('</file_events>\r\n')
self.response.setStatus(200)
self.response.setHeader('content-type', 'text/xml')
self.response.write(output)
Also, make sure the following is located at the top of the xml_handler.py file:
from splunk import auth, search import splunk.rest import time import logging as logger
Breifly, what we're doing here is telling splunkd to do a search using the search_string variable, wait until the search is completed, and then iterate through the results outputting basic XML using clientip and the referer_domain for the code_swarm filename and author, respectively.
For this example to work, you'll need to have your Splunk install eating some type of access_combined logfiles. These are typically generated by Apache and other similar web servers and load balancers. Notice the XML handler code above which contains the following Splunk search:
sourcetype="access_combined" startminutesago=120 | where referer_domain > "" | where clientip > ""
Copy and paste this search into your Splunk UI and see what you get back. Ideally, you'll get back a few thousand events from the last couple of hours. If you have too many events, adjust the number of minutes ago to something reasonable for your given data set.
Make sure you have clientip and referer_domain extracted. If you need to, use the 'fields' pulldown on the left to select them. These checkboxes only modify the Splunk UI, but they show you what fields you have available for extraction.
If you are getting more than 10,000 results with your search, you might want to drop the number of minutes you are querying to something smaller. While code_swarm works fine with 10s of thousands of results, you'll need to do a little tweaking if you want Splunk to return that many results. However, if you are bound and determined to get >10,000 events visualized, create (or edit) the limits.conf file in your $SPLUNK_HOME/etc/system/local/ directory to contain the following:
[restapi] # maximum result rows to be return by /events or /results getters from REST API maxresultrows = 50000 [search] # the last accessible event in a call that takes a base and bounds max_count = 50000
Keep in mind that setting these values higher causes Splunk to store more data on your system's drives during searches!
The endpoint for the XML file is going to be based off your Splunk install host and domain. In this example we'll use an example of www.foobar.com, with splunkd running on port 8089. You may want to double check your installed hostname and port as they can vary depending on your install.
Here's what the URL would look like for www.foobar.com, port 8089:
https://www.foobar.com:8089/services/swarm/xml/
You can test this by entering the URL (modified for your particular install) in your browser. Keep in mind it may take a minute or two to load the data!
To make things simple, we'll use the wget command to pull the data over to our local machine where we'll run the code_swarm visualization. Splunk can do authentication via tokens, so we'll first connect, auth, grab a token, set foo equal to it, and then pass it to wget to get our XML file:
$ foo=`wget -O - -q --no-check-certificate --post-data="username=admin&password=changeme" https://www.foobar.com:8089/services/auth/login/ |egrep -o '[a-z0-9]{32}'` $ wget -O /PATH/TO/CODE_SWARM/data/events.xml --no-check-certificate --header="Authorization: Splunk $foo" https://www.foobar.com:8089/services/swarm/xml/
Again, the last command may take a few minutes to complete based on your search, the number of results, and the speed of your server. Don't forget we're using an example domain here, so be sure to substitute your own hostname for foobar.com.
The code_swarm package has a sample config file called sample.config. Either copy this to a new file, or edit the exiting one to point at your new XML file pulled in from Splunk. Here are some suggested values to use for visualizing web access logs:
Width=640 Height=480 MillisecondsPerFrame=2 FileLife=300 PersonLife=150 HighlightPct=5 FileSpeed=7.0 PersonSpeed=2.0
Lastly, edit the input file parameter:
InputFile=data/events.xml
NOTE: At a minimum you need to change the MillisecondsPerFrame parameter. If you don't you will probably get a blank white screen which exits after 5-10 seconds.
Just as you did above to test, you can run code_swarm from its directory by simply typing:
$ ./run.sh
For larger data sets it may take a few seconds to parse and load the initial animation.
If you want to visualize other event fields extracted by Splunk, simply edit the XML handler code to use the proper fields. Splunk uses simple key lookups for fields it knows about. For example, change:
search_string = 'search sourcetype="access_combined" startminutesago=60 | where referer_domain > "" | where clientip > ""' filename = result['clientip'] author = result['referer_domain']
To this:
search_string = 'search sourcetype="access_combined" startminutesago=60 | where clientip > "" | where file > ""' filename = result['file'] author = result['clientip']
This results in graphing IP address as the text nodes, and the different URIs in your system to dots which will end up surrounding the IP addresses as people visit and view pages on your site.
Try experimenting with different data sets and reversing the order of the filename/author mappings. Look for clustering, movement, and size of dots to indicate activity in your system.