TIPS & TRICKS

Splunkgit – Github just got Splunked! (Part 2/4)

This is the second part in a four part series where Petter and Emre covers their Splunk app, Splunkgit. The Splunk app is available for download on splunkbase here, and it is also on github here. You can find part 1 here if you missed it.

Who am I?

Hello there blog reader! As this is my first post, I will do what I am told and introduce myself. My name is Petter Eriksson and I study computer science at Royal Institute of Technology, Stockholm, Sweden. I am an intern here at Splunk for 6 months, where I will be doing software development and also my master thesis. Being an intern here at Splunk has been great so far and I am really excited about telling you about what me and my fellow intern, Emre, has done. So lets get started!

What is in this code heavy part 2?

In the previous part of this blog series, we covered the basics of how you could index a git repository into Splunk. The things we included in the Splunk indexing then was:
Time, author name, author email and commit hash
We’ll refer to this as commit info from here on.

In this second part of the series, we will:

  1. Make our git script more advanced by extracting file, insertions and deletions.
  2. Tell you how to avoid duplicated data when using github’s v3 API.

Note: This post is pretty code heavy. Check out part 3 and 4, if you just want to see our results.

Filling Splunk with more git knowledge

Before we get into code details, we’re going to show you what we’re working with and what we are going to end up with.
Right now, our git script outputs a bunch of lines like this, which is the commit info:

[2006-05-14 09:22:55 +0000] author_name="Junio C Hamano" author_mail="junio@hera.kernel.org" commit_hash="2810e9eed91c429686039ca61781492336ca9410"
[2006-05-13 14:00:16 -0700] author_name="Linus Torvalds" author_mail="torvalds@osdl.org" commit_hash="d14f776402d9f7040cc71ff6e3b992b2e019526a"

We’re going to unfold each commit, revealing all the files for the commits with their insertions and deletions.
Here’s what the result is going to look like:

[2006-05-14 09:22:55 +0000] author_name="Junio C Hamano" author_mail="junio@hera.kernel.org" commit_hash="2810e9eed91c429686039ca61781492336ca9410" insertions="15" deletions="3" path="man1/git-cvsexportcommit.1"
[2006-05-14 09:22:52 +0000] author_name="Junio C Hamano" author_mail="junio@hera.kernel.org" commit_hash="0fd4dbd5f31a6526f88cc4613296a5603a193929" insertions="20" deletions="3" path="git-cvsexportcommit.html"
[2006-05-14 09:22:52 +0000] author_name="Junio C Hamano" author_mail="junio@hera.kernel.org" commit_hash="0fd4dbd5f31a6526f88cc4613296a5603a193929" insertions="8" deletions="1" path="git-cvsexportcommit.txt"
[2006-05-13 14:00:16 -0700] author_name="Linus Torvalds" author_mail="torvalds@osdl.org" commit_hash="d14f776402d9f7040cc71ff6e3b992b2e019526a" insertions="81" deletions="13" path="config.c"
[2006-05-13 14:00:16 -0700] author_name="Linus Torvalds" author_mail="torvalds@osdl.org" commit_hash="d14f776402d9f7040cc71ff6e3b992b2e019526a" insertions="6" deletions="4" path="repo-config.c"
[2006-05-13 14:00:16 -0700] author_name="Linus Torvalds" author_mail="torvalds@osdl.org" commit_hash="d14f776402d9f7040cc71ff6e3b992b2e019526a" insertions="3" deletions="3" path="t/t1300-repo-config.sh"

With this addition to the our index, we can view graphs in Splunk over for example:

  • Who’s edited a file the most.
  • Which author has edited the most lines.
  • When a certain package or module was created and how its maintenance has gone over time.
  • And more…

The time it takes to run the script for the first time will increase with this addition, but since we’re using the –skip flag to git-log, it’s only the first time that will take a long time.

How to create the more advanced git script

First off we have to iterate over all the commits, which we do by using a part of the command in part 1 of this blog series, but this time we only print the commit hash:

  for commit in `git log --pretty=format:'%H' --all --no-color --no-renames --no-merges --skip=$NUMBER_OF_COMMITS_TO_SKIP`
  do
      # inner loop code goes here
  done

The next part is pretty tricky. We’re going to use git-diff-tree, sed, awk, and tee. If you are unfamiliar with any of those commands, feel free to look them up, but our explanation will hopefully be sufficient for your understanding.

We want to have our commit info in front of each file change. So lets start off with:

  git diff-tree $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --numstat

This should look familiar to you from the previous blog part because we use the –format flag, except now we use git-diff-tree instead of git-log.

We also added the –numstat flag to git-diff-tree which prints rows for each commit with:

  <insertions> <deletions> <file>

These lines are the ones that we’ll put the commit info in front of. Each of those rows will be our events to Splunk.

We can now use awk, which “…scans  each  input  file  for lines that match any of a set of patterns specified…”. This allows us to iterate over all the lines which our git-diff-tree command outputs and edit each line with a pattern of our choosing.

Example output of our git-diff-tree command:

  [2011-11-01 16:18:38 -0700] author_name="Petter Eriksson" author_mail="periksson@splunk.com" commit_hash="eba500f1e6df6e314118cce9c7af47d960949381"
4	5	bin/fetch_git_repo_data.sh
2	0	README

As you can see from the example output, our first line has the commit info, and then each file change is listed below. We should then start by capturing this first line, then printing this first line before the rest of the following lines. Our awk command uses “-F \t” to set that the column breaker is a tab and we also use -v FIRST_LINE=1 to initialize a variable that can be used in the awk command.

Our resulting awk command is:

  awk -F \t -v FIRST_LINE=1 '{if (FIRST_LINE==1) {FIRST_LINE=0;COMMIT_INFO=$0} else {print COMMIT_INFO" insertions=\""$1"\" deletions=\""$2"\" path=\""$3"\"}}'

The –numstat flag generates some extra new lines, so we want to remove those. We do that by piping the output from the previous command to:

  sed '/^$/d'

The sed command above basically says: “replace all lines starting with line break”, which is equal to “remove all empty lines” in our case.

So far the inner loop of our iteration over commits looks like this:

  git diff-tree $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --numstat | sed '/^$/d' | awk -F \t -v FIRST_LINE=1 '{if (FIRST_LINE==1) {FIRST_LINE=0;COMMIT_INFO=$0} else {print COMMIT_INFO" insertions=\""$1"\" deletions=\""$2"\" path=\""$3"\"}}'

The problem with what we have so far, is that the –numstat flag doesn’t output any lines if there was no files changed in the commit. We solve this by saving the output of the script to a file, while keeping the output to the stdout, so Splunk still gets whatever output is printed. This is achieved by piping the output of the script to the command tee, like so:

  git diff-tree $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --numstat | sed '/^$/d' | awk -F \t -v FIRST_LINE=1 '{if (FIRST_LINE==1) {FIRST_LINE=0;COMMIT_INFO=$0} else {print COMMIT_INFO" insertions=\""$1"\" deletions=\""$2"\" path=\""$3"\"}}' | tee $output_file

Now we can after the script check if there was any output like this:

  if [ ! -s $output_file ]; then #if there was no output from --numstat
    #do something
  fi

We want to print commit info in that if statement. Lets do that:

  if [ ! -s $output_file ]; then #if there was no output from --numstat
    git show $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --quiet
    echo ;
  fi

The last echo ; is there because git-show with –quiet flag does not print a line break. We need the line break to let Splunk know that the event has ended.

Other than all of the above, you want to touch the output_file and remove it after each loop. Here’s the whole script that we’ve covered:

  output_file=git-commit-formatted.out #temporary gather git-diff-tree output
  #for each commit in the git history
  for commit in `git log --pretty=format:'%H' --all --no-color --no-renames --no-merges --skip=$NUMBER_OF_COMMITS_TO_SKIP`
  do
    touch $output_file

    git diff-tree $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --numstat | sed '/^$/d' | awk -F \t -v FIRST_LINE=1 '{if (FIRST_LINE==1) {FIRST_LINE=0;COMMIT_INFO=$0} else {print COMMIT_INFO" insertions=\""$1"\" deletions=\""$2"\" path=\""$3"\"}}' | tee $output_file

    if [ ! -s $output_file ]; then #if there was no numstat output, just print the commit_info
      git show $commit --pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H"' --quiet
      echo ;
    fi

    rm $output_file #clean up
  done

What mama didn’t tell you:

There are somethings that’s left out about the script in this blog post. It has to do with error handling and extracting the file type of a file from a path, and we didn’t feel it was within the scope of this post. What is covered here should be enough for testing out Splunk with git. You can always look at our github repository if you want the whole script!

Avoid duplication when splunkin’ github v3 API

In part 1 of this blog series, we covered how to get issues from github using their v3 API. Here we’ll tell you how to use Splunk searches to avoid getting duplicated events.

Last time we just fetched all issues from github every time we wanted to update the data in Splunk. To avoid fetching all of them over and over again, we do the following:

  1. Use a Splunk search to check which was the last updated issue.
  2. Get the time that the issue was updated.
  3. Fetch all the issues from github that has been updated since that time.

This is all possible with some neat python scripting and Splunks python library. More on how to use Splunks python search API here. We use the following from Splunks python client library in our script:

  import splunk.auth
  import splunk.search as search

Using Splunk to search for the issue

It’s very important to wait for a search to complete when using Splunks python API, because you won’t get any search results if you don’t. Here’s how the search for the last updated issue is done:

def _search_for_last_updated_issue(self):
    issue_search = search.dispatch('search index=splunkgit sourcetype="github_data" github_issue_update_time=* | sort -str(github_issue_update_time) | head 1')
    while not issue_search.isDone:
      time.sleep(0.5) #sleep for a while
    return issue_search

Now we’re going to get the update time from the search. First we check if the search had no results. This can happen either if it was the first time searching – before issues has ever been fetched – or if the search is wrong some how. Otherwise we get the head of the search:

  def _get_update_time_from_search(self, search):
    if len(search) is 0:
      return None
    else:
      return self._get_update_time_from_head_of_search(self, search)

The search events are saved in search.events. You get the first event at index 0. Then you can retrieve a value from the event based on a key, which in our case is ‘github_issue_update_time’.

  def _get_update_time_for_head_of_search(self, search):
    return search.events[0]['github_issue_update_time']

Then we put it all together with another method:

  def time_of_last_updated_issue(self):
    last_updated_issue_search = self._search_for_last_updated_issue()
    return self._get_update_time_from_search(last_updated_issue_search)

To our python script, GithubAPI.py, we added a method called issues_since, which fetches all the issues since a time. I hope you can see what we did here:

  def issues_since(self, since):
    request_issues_since = 'issues?since={0}'.format(since)
    return self._open_issues(request_issues_since) + self._closed_issues(request_issues_since)

We basically just added the since=<insert_time> parameter to the API call. If you’re not sure what this code means, please check out the previous part about how to do requests to the github v3 API.

Now we can call these new pieces of code like this from our fetch_github_data.py script:

  splunk_api = SplunkAPI('admin','changeme') #username, password to Splunk
    since = splunk_api.time_of_last_updated_issue()
    if since is None:
      since = '1900-01-01T00:00:01Z'
    all_issues =  github_api.issues_since(since)

Handle the case where the Splunk search doesn’t find any last updated issue time, fetch all the issues from github and what do what you want with the issue data!

But my issues are still replicating like teenagers!

If you’re still having duplicated issues in Splunk, I recommend that you take a look at your search in the method _search_for_last_updated_issue. Run the search that you have specified in the search.dispatch call, inside Splunk – when you have indexed data – and make sure that you get results from the search.

Last minute tip

You can use this Splunk-search-strategy-to-not-get-duplicated-values for github_watcher_count and github_forks_count as well.

  1. Do a Splunk search for your last github_watcher_count or github_forks_count.
  2. Fetch the current count from github.
  3. Only print as an event to Splunk if the results from (1) and (2) differ.

That was all for this time! If you’ve followed both part 1 and 2, you should have a lot of data in Splunk ready to be explored.
Go to next parts and see what we did with our indexed data! part 3 and part 4.

----------------------------------------------------
Thanks!
Petter Eriksson

Splunk
Posted by

Splunk