Splunk is a Verb – Splunking Perforce Data

This is a continuation of my last posting on using Splunk to gather data from a source code repository, in our case Perforce (P4). Perforce not only gives you a simple CLI, but it also can return results as a serialized Python object by using the “-G” option. See (

Leveraging this, one can write a simple Python script to provide info on P4 activity as key/value pairs which Splunk indexes naturally. In BI parlance, this is your Extract (p4 command), Transform (unmarshalling to key/value pairs), Load (indexing of the data).

How one goes from P4 to getting data into Splunk is via a method called “Scripted Inputs.” (See: Documentation) Scripted inputs is a simple concept: whatever the script outputs is piped directly into Splunk. No need to write adapters, connectors, or what have you in order to get data into a database (doing transforms along the way using complicated tools)–if you write some simple Python, you can get information into Splunk quickly.

There are a few things to do here:

  • Write the script itself
  • Add the script to the inputs.conf file
  • Create an index especially for this data so we don’t put the data into the main index

Note: For purposes of this example, I didn’t want to display Splunk’s engineering data (even though there’s likely little one can glean from it). The fortunate thing is that Perforce, though not open source, provides free licenses to open source projects (free as in beer). I choose the Eigenbase project, simply because I’m familiar with the cool stuff they do. We’ve setup a Perforce proxy server for the eigenbase repository in EC2 for the purpose of this blog, and we’ll likely tear it down in a week or so. A trivial exercise is to have it run against any Perforce repository.

Okay, back to work. Let’s start with the script first. The following are some constants to define.

  # ----------------------------------------------------------
  # Constants

  # format to output dates in so splunk will index them
  splunkDateFormat = "%m/%d/%Y:%H:%M:%S"
  # set to zero if you want to get everything, but let's be nice and not start from 0
  lowestChangeNumber = 7000
  highestChangeNumber = 8000
  # p4 CLI - assume "p4" is in your path
  p4CLI = "p4 -u guest -P \"\""
  p4port = "" # note: this is a transient proxy server
  out = sys.stdout

That should be pretty straight forward. We’re basically setting values for P4, and Splunk itself (Actually for now, just focus on the P4 values). Now we have our main method:

def main(self):
      os.environ['P4PORT'] = self.p4port
      chno = self.lowestChangeNumber
      # loop across range of changes
      while chno < self.highestChangeNumber:
        time.sleep(1) # to be nice
        chno =  chno + 1
        f = os.popen(self.p4CLI + " -G describe " + str(chno))
        d = marshal.load(f)
        if (d['code'] == 'error'):
          if d['data'].find('no such changelist') == -1:

    except Exception, oops:

What this code does is the following:

  1. Sets the p4port
  2. Loops through the range of numbers (if you want to have the whole repository, you’d start from zero and the latest p4 change number)
  3. Checks for error upon retrieving the p4 change info
  4. if all is good pass it on to the output

The output is relatively simple in that we want to get things into “key=value” pairs. Thus we have two methods. The first method gets the description per p4 change, and the second method then gets the file info (There is a 1 to many relationship between change numbers and files):

  def outputChangeDescription(self, desc):
    fileNum = 0
      if desc.has_key('depotFile' + str(fileNum)):
        self.outputFileChangeDescription(desc, fileNum)
        fileNum += 1

  def outputFileChangeDescription(self, desc, fileNum):
    timestamp = datetime.fromtimestamp(int(desc['time'])).strftime(self.splunkDateFormat)
    self.out.write(' change="')
    self.out.write('" user="')
    self.out.write('" file="')
    filename = desc['depotFile' + str(fileNum)]
    self.out.write('" rev="')
    self.out.write(desc['rev' + str(fileNum)])
    self.out.write('" action="')
    self.out.write(desc['action' + str(fileNum)])
    dot = filename.rfind(".")
    if dot != -1:
      self.out.write(' ext="')
      self.out.write(filename[dot + 1:])

The output comes out like this:

06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/farrago/unitsql/optimizer/lcs.ref" rev="32" action
="edit" ext="ref"
06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/farrago/unitsql/optimizer/lcs.sql" rev="12" action
="edit" ext="sql"
06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/fennel/tuple/TupleData.cpp" rev="9" action="edit"
06/23/2006:15:11:20 change="7002" user="jack" file="//open/dt/dev/fennel/disruptivetech/calc/ExtWinAggFuncs.cpp" rev=
"4" action="edit" ext="cpp"
06/23/2006:15:11:20 change="7002" user="jack" file="//open/dt/dev/fennel/disruptivetech/test/CalcExtWinAggFuncTest.cp
p" rev="5" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/farrago/NativeMethods_lu.cpp"
rev="35" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxAggExecStream.cpp"
rev="7" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxAggExecStream.h" re
v="4" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxHashBase.h" rev="4"
 action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxHashGenerator.h" re
v="4" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxJoinExecStream.cpp"
 rev="14" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxJoinExecStream.h" r
ev="12" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxPartition.cpp" rev=
"6" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxPartition.h" rev="5
" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/test/LhxAggExecStreamTest.cpp"
 rev="3" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/test/LhxJoinExecStreamTest.cpp
" rev="8" action="edit" ext="cpp"
06/24/2006:10:59:11 change="7004" user="bchow" file="//open/mondrian/testsrc/main/mondrian/test/
" rev="5" action="edit" ext="java"

You’ll notice that each “event” is at a file change level of granularity. I’ll show later that it’s easy to tease out aggregates at the change set level, rather than the file level.

In developing the script, there’s lots of debugging and you can’t simply run the script on its own, since it may have dependencies on the calling process (in this case Splunk). So one key trick to developing any scripted input is the following Splunk commandline:

./splunk cmd python ../etc/apps/search/bin/

If it outputs successfully to standard out, then you are ready to Splunk that data. Before learning this command, trying to debug a scripted input was quite hellish.

In order to register the script for Splunk to run, you need to make two changes: register the script, and create an index for the data to go to.

To add a stanza in inputs.conf, you can simply copy etc/apps/search/default/inputs.conf to etc/apps/search/local/inputs.conf and add the following:

interval = -1
sourcetype = p4-sample
source =
index = p4-sample
disabled = false

Note that the above “interval = -1” indicates that the script will be executed a single time upon startup. (which also implies if you restart, it’ll be re-executed). I mostly did this for simplicity.

And from Manager, create a new index called “p4-sample”. If you look in your indexes.conf file (in etc/system/local), you’ll see a stanza:

coldPath = $SPLUNK_DB/p4-sample/colddb
homePath = $SPLUNK_DB/p4-sample/db
thawedPath = $SPLUNK_DB/p4-sample/thaweddb

One thing to note is why I bothered to define another index: this allows you to blow away the data and not affect anything else you may have been working on. If you mess up, just type:

./splunk clean eventdata p4-sample

With that, you can do a simple search of “index=p4-sample” over all time, and get the following chart:

Files Changed Over Time

Files Changed Over Time

The spikes are very pronounced, probably due to mass updates of files at certain times. You can get the change info granularity by using the “dedup” command, and enter this in the search bar: “index=p4-sample | dedup change”, which yields a more informative graph:

index=p4-sample | dedup change
Changes Over Time

Changes Over Time

You can also do a top 10 report very simply. Try the following command:

 index=p4-sample | dedup change | top 10 user
Top Ten Submitters

Top Ten Submitters

This sample is of course a toy sample, since as I said, it’ll reindex the range of changes at each startup. However, from some very simple search strings, you immediately get visibility into what’s going on.

I’ll show in the next blogs how to retrieve at startup time, and then incrementally update as further changes are made to the repository. I’ll also show how to do what’s known as “field extractions” to give you additional fields that can be searched upon.

Credits: ElinRydberg, PatrickCalahan for scripting help, JohnSichi for the Eigenbase data, JamesTerry, GregAlbrecht for the EC2 P4 proxy.

Boris Chen

Posted by