Splunk is a Verb – Applying Splunk to Source Code Repository Activity

It’s become a thing here–whenever there is some data that needs analysis, the immediate reaction is “to splunk it”. The use of Splunk as a verb is something that comes naturally once you realize that Splunk is not just for log analysis, but can be a means to understand a variety of time-based datasets. Let me repeat: Splunk isn’t just for logs.

Data is all around us these days, and most of the time we take it for granted. In the business world however, that is rarely the case–the first thing people do is analyze data (which accounts for how much software is dedicated to business applications and analysis). When it comes to software engineering, we tend to operate with a different mindset. In the business world, there’s Business Objects, Cognos, etc. What are the equivalents for software engineering data analysis? None come to mind. On the upside, despite there not being a single standard software engineering system, there are a number of tools used in software engineering, and each of them does represent a data source, an untapped treasure of information. However, getting access to that information isn’t a trivial task.

As an example, let’s analyze the activity of a source control repository, which (one would hope) every software engineering group will have.

I’d like to answer questions of this sort:

  • who’s changing files?
  • what kinds of files are being changed?
  • what branches are most active?
  • what types of activities are occuring for a branch?
  • what actions are engineers doing? (adding, editing, deleting, merging, etc.)
  • who is changing what kind of files?

And do this for varying windows of time.

An RDBMS is a common way to do data analysis. However, getting from data source to analysis can be a long road. I’d have to define a schema, write SQL to create the tables, write some scripts to query the source repository of data and convert it to some digestible format (say CSV), then bulk load it into the db (heaven forbid if some varchar fields are too small, and things get truncated, or if I get some of the parsing wrong, and I have crap in columns that I need to repair). Further, if I want to change my schema, I may have to reimport everything or do some SQL to reorganize the data. Then I’d need to figure out how to get data out into nice reports and choose a means by which to report; the number of available reporting tools is kind of overwhelming.

I could do that.

Or I could just splunk it, and avoid all that headache.

At Splunk we use Perforce which has a simple commandline interface that can give you all the information about changes happening in the system. But it’s not a log file! Correct. There’s no SQL! Right again! How will it be splunked?

We’ll start with that in the next blog posting.

Boris Chen

Posted by


Join the Discussion