Tips & Tricks

August 10, 2010

6 Minute Read

Splunk is a Verb – Splunking Perforce Data

By Splunk

This is a continuation of my last posting on using Splunk to gather data from a source code repository, in our case Perforce (P4). Perforce not only gives you a simple CLI, but it also can return results as a serialized Python object by using the “-G” option. See (http://kb.perforce.com/article/585/using-p4-g)

Leveraging this, one can write a simple Python script to provide info on P4 activity as key/value pairs which Splunk indexes naturally. In BI parlance, this is your Extract (p4 command), Transform (unmarshalling to key/value pairs), Load (indexing of the data).

How one goes from P4 to getting data into Splunk is via a method called “Scripted Inputs.” (See: Documentation) Scripted inputs is a simple concept: whatever the script outputs is piped directly into Splunk. No need to write adapters, connectors, or what have you in order to get data into a database (doing transforms along the way using complicated tools)–if you write some simple Python, you can get information into Splunk quickly.

There are a few things to do here:

Write the script itself
Add the script to the inputs.conf file
Create an index especially for this data so we don’t put the data into the main index

Note: For purposes of this example, I didn’t want to display Splunk’s engineering data (even though there’s likely little one can glean from it). The fortunate thing is that Perforce, though not open source, provides free licenses to open source projects (free as in beer). I choose the Eigenbase project, simply because I’m familiar with the cool stuff they do. We’ve setup a Perforce proxy server for the eigenbase repository in EC2 for the purpose of this blog, and we’ll likely tear it down in a week or so. A trivial exercise is to have it run against any Perforce repository.

Okay, back to work. Let’s start with the script first. The following are some constants to define.

  # ----------------------------------------------------------
  # Constants

  # format to output dates in so splunk will index them
  splunkDateFormat = "%m/%d/%Y:%H:%M:%S"
  # set to zero if you want to get everything, but let's be nice and not start from 0
  lowestChangeNumber = 7000
  highestChangeNumber = 8000
  # p4 CLI - assume "p4" is in your path
  p4CLI = "p4 -u guest -P \"\""
  p4port = "eigenbase-p4proxy.splunk.com:1666" # note: this is a transient proxy server
  out = sys.stdout

That should be pretty straight forward. We’re basically setting values for P4, and Splunk itself (Actually for now, just focus on the P4 values). Now we have our main method:

def main(self):
    try:
      os.environ['P4PORT'] = self.p4port
      chno = self.lowestChangeNumber
      # loop across range of changes
      while chno < self.highestChangeNumber:
        time.sleep(1) # to be nice
        chno =  chno + 1
        f = os.popen(self.p4CLI + " -G describe " + str(chno))
        d = marshal.load(f)
        if (d['code'] == 'error'):
          if d['data'].find('no such changelist') == -1:
            printError(d['data'])
          continue
        else:
          self.outputChangeDescription(d)

    except Exception, oops:
      traceback.print_exc()

What this code does is the following:

Sets the p4port
Loops through the range of numbers (if you want to have the whole repository, you’d start from zero and the latest p4 change number)
Checks for error upon retrieving the p4 change info
if all is good pass it on to the output

The output is relatively simple in that we want to get things into “key=value” pairs. Thus we have two methods. The first method gets the description per p4 change, and the second method then gets the file info (There is a 1 to many relationship between change numbers and files):

  def outputChangeDescription(self, desc):
    fileNum = 0
    while(True):
      if desc.has_key('depotFile' + str(fileNum)):
        self.outputFileChangeDescription(desc, fileNum)
        fileNum += 1
      else:
        break

  def outputFileChangeDescription(self, desc, fileNum):
    timestamp = datetime.fromtimestamp(int(desc['time'])).strftime(self.splunkDateFormat)
    self.out.write(timestamp)
    self.out.write(' change="')
    self.out.write(desc['change'])
    self.out.write('" user="')
    self.out.write(desc['user'])
    self.out.write('" file="')
    filename = desc['depotFile' + str(fileNum)]
    self.out.write(filename)
    self.out.write('" rev="')
    self.out.write(desc['rev' + str(fileNum)])
    self.out.write('" action="')
    self.out.write(desc['action' + str(fileNum)])
    self.out.write('"')
    dot = filename.rfind(".")
    if dot != -1:
      self.out.write(' ext="')
      self.out.write(filename[dot + 1:])
      self.out.write('"')
    self.out.write('\n')
    self.out.flush()

The output comes out like this:

06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/farrago/unitsql/optimizer/lcs.ref" rev="32" action
="edit" ext="ref"
06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/farrago/unitsql/optimizer/lcs.sql" rev="12" action
="edit" ext="sql"
06/23/2006:15:10:31 change="7001" user="zfong" file="//open/lu/dev/fennel/tuple/TupleData.cpp" rev="9" action="edit"
ext="cpp"
06/23/2006:15:11:20 change="7002" user="jack" file="//open/dt/dev/fennel/disruptivetech/calc/ExtWinAggFuncs.cpp" rev=
"4" action="edit" ext="cpp"
06/23/2006:15:11:20 change="7002" user="jack" file="//open/dt/dev/fennel/disruptivetech/test/CalcExtWinAggFuncTest.cp
p" rev="5" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/farrago/NativeMethods_lu.cpp"
rev="35" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxAggExecStream.cpp"
rev="7" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxAggExecStream.h" re
v="4" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxHashBase.h" rev="4"
 action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxHashGenerator.h" re
v="4" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxJoinExecStream.cpp"
 rev="14" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxJoinExecStream.h" r
ev="12" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxPartition.cpp" rev=
"6" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/hashexe/LhxPartition.h" rev="5
" action="edit" ext="h"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/test/LhxAggExecStreamTest.cpp"
 rev="3" action="edit" ext="cpp"
06/23/2006:23:47:26 change="7003" user="rchen" file="//open/lu/dev_lcs/fennel/lucidera/test/LhxJoinExecStreamTest.cpp
" rev="8" action="edit" ext="cpp"
06/24/2006:10:59:11 change="7004" user="bchow" file="//open/mondrian/testsrc/main/mondrian/test/DrillThroughTest.java
" rev="5" action="edit" ext="java"

You’ll notice that each “event” is at a file change level of granularity. I’ll show later that it’s easy to tease out aggregates at the change set level, rather than the file level.

In developing the script, there’s lots of debugging and you can’t simply run the script on its own, since it may have dependencies on the calling process (in this case Splunk). So one key trick to developing any scripted input is the following Splunk commandline:

./splunk cmd python ../etc/apps/search/bin/P4Simple.py

If it outputs successfully to standard out, then you are ready to Splunk that data. Before learning this command, trying to debug a scripted input was quite hellish.

In order to register the script for Splunk to run, you need to make two changes: register the script, and create an index for the data to go to.

To add a stanza in inputs.conf, you can simply copy etc/apps/search/default/inputs.conf to etc/apps/search/local/inputs.conf and add the following:

[script://./bin/P4Simple.py]
interval = -1
sourcetype = p4-sample
source = eigenbase-p4proxy.splunk.com:1666
index = p4-sample
disabled = false

Note that the above “interval = -1” indicates that the script will be executed a single time upon startup. (which also implies if you restart, it’ll be re-executed). I mostly did this for simplicity.

And from Manager, create a new index called “p4-sample”. If you look in your indexes.conf file (in etc/system/local), you’ll see a stanza:

[p4-sample]
coldPath = $SPLUNK_DB/p4-sample/colddb
homePath = $SPLUNK_DB/p4-sample/db
thawedPath = $SPLUNK_DB/p4-sample/thaweddb

One thing to note is why I bothered to define another index: this allows you to blow away the data and not affect anything else you may have been working on. If you mess up, just type:

./splunk clean eventdata p4-sample

With that, you can do a simple search of “index=p4-sample” over all time, and get the following chart:

index=p4-sample

Files Changed Over Time

The spikes are very pronounced, probably due to mass updates of files at certain times. You can get the change info granularity by using the “dedup” command, and enter this in the search bar: “index=p4-sample | dedup change”, which yields a more informative graph:

index=p4-sample | dedup change

Changes Over Time

You can also do a top 10 report very simply. Try the following command:

 index=p4-sample | dedup change | top 10 user

Top Ten Submitters

This sample is of course a toy sample, since as I said, it’ll reindex the range of changes at each startup. However, from some very simple search strings, you immediately get visibility into what’s going on.

I’ll show in the next blogs how to retrieve at startup time, and then incrementally update as further changes are made to the repository. I’ll also show how to do what’s known as “field extractions” to give you additional fields that can be searched upon.

Credits: ElinRydberg, PatrickCalahan for scripting help, JohnSichi for the Eigenbase data, JamesTerry, GregAlbrecht for the EC2 P4 proxy.

----------------------------------------------------
Thanks!
Boris Chen

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

Tips & Tricks 3 Min Read

Making SNMP Simpler

Tips & Tricks 5 Min Read

Scale Your Splunk Cloud Operations With The Splunk Content Manager App

To address the absence of a solution that simplifies the administration of Splunk instances and aligns with Splunk's ultimate goal of focusing on generating value from data, we created the App Content Manager for Splunk.

Tips & Tricks 1 Min Read

Shiny Splunk Cake

Set your Shiny Splunk Cake by using our new Deployment Map for customers who may not have the time or resources to create their own.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Splunk is a Verb – Splunking Perforce Data

Related Articles

Making SNMP Simpler

Scale Your Splunk Cloud Operations With The Splunk Content Manager App

Shiny Splunk Cake

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram