Tips & Tricks

November 09, 2011

4 Minute Read

Splunkgit – Github just got Splunked! (Part 1/4)

By Splunk

This is the first part in a four part series where Emre and Petter covers their Splunk app, Splunkgit. The Splunk app is available for download on splunkbase here, and it is also on github here

Knowledge is power ~ Sir Francis Bacon

I believe this to be true. Since data is the knowledge of the digital age, Splunk is all one needs to have power. For most of the readers of this blog, it’s no news that Splunk is a very powerful piece of software, but I just discovered this. My name is Emre Berge Ergenekon and this is my first blog post. I’m an computer science student at the Royal Institute of Technology, Stockholm Sweden, doing my masters thesis at Splunk. Together with Petter, our first assignment was to create a Splunk app. This turned out to be really fun. Thanks to Boris who gave us a great idea we were able to create something really cool.

We Splunked github!

Our goal was to visualize data that is retrievable through github. We wanted to present it in a way that helped developers to easily overview and analyze the status of a repository. However we didn’t want to tie the app to github, so it is also useful for non-github projects. We have therefor designed our Splunk app to separate the github and git repository code.

There are two sources the scripts retrieve the data from. One of them is github API and the other is git repository logs.

Github

Getting data from github was easy with their v3 API. With python, we were able to fetch information about issues, watchers and forks of a repositories. You can simply get list of watchers on a github repo with curl (don’t forget to substitute user-name and repo-name):

curl https://api.github.com/repos/<user-name>/<repo-name>/watchers

In a similar way you can get info about the forks:

curl https://api.github.com/repos/<user-name>/<repo-name>/forks

Most of the publicly available data is also available, without authentication, through the API.

Some important aspects of the API

HTTPS
All API access is over HTTPS. The requests are always made against the api.github.com domain.
JSON
The data returned by the API is always in the JSON format, also all data sent in the requests has to be JSON.
Rate Limit
You can only make a total of 5000 requests per hour (it’s possible to get you application white listed for more requests). The limit and remaining count is present in the response header.
```
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4711
```
Pagination
If a request returns multiple items the number of items returned is limited to 30 per request. However this is the default value, using ?per_page parameter this value can be set up to 100.

In the request we make this value is always set to 100. This way we minimize the number of requests needed for collecting information. As an example, let the the repo Splunkgit have 230 watchers. To create a list of the watchers you need to make 8 requests with the default pagination value, 30. But with per_page set to 100 you can retrieve the same information with only 3 requests. As you’ll expect this makes the scripts run faster and doesn’t use much of you hourly rate limit quota.

For iterating the pages you can look up the link header. As an example the following header info is present in an response that has multiple pages:
```
Link: <https://api.github.com/repos?page=3&per_page=100>; rel="next", <https://api.github.com/repos?page=50&per_page=100>; rel="last"
```

Issues

So far we were able the retrieve data about the watcher and fork counts. While this information is great to monitor popularity of the repository, what really makes this app useful, as a tool, is the github issue data. We poll the API for a list of opened as well as closed issues. The splunked issue information is as follows:

Issue number
The unique identifier of an issue.
Issue State
Open or Closed.
Comment count
The number of comments on the issue.
Reporter
The github user name of the issues reporter.
Title
The title of the issue
Creation, update and close times
An value for the three different time values available.

The above information is later used the create dashboards for fast overviewing:

Newest issues
Latest updated issues
Oldest unclosed issues

And more.

Where the API fell short

Retrieving all the forks originating from specified repo is actually an hard task to do which requires lots of requests. The task requires iterating over all the forks in all levels and building a list of them. At the end the size of this list is equal to the fork count that you can see in github.

A repository with 200 forks needs 200 requests to build this list. A sequential execution of the requests would take forever, thats why we used a library called joblib. Joblib makes the requests in parallel using specified amount of jobs.

Example:

list_of_returned_values = Parallel(n_jobs=<Number of jobs>)(delayed(feth_list_of_forks)(list_of_repos[i]) for i in range(len(list_of_repos)))

Result:

list_of_returned_values[0] = feth_list_of_forks(list_of_repos[0])
list_of_returned_values[1] = feth_list_of_forks(list_of_repos[1])
.
.
.
list_of_returned_values[i] = feth_list_of_forks(list_of_repos[i])

Git log

The git log command is very powerfull. Without any need for parsing you can easily retrieve splunkable information about commits. To format the output of each commit on a single line and with the key=value pattern we used: –pretty=format argument for git log.

The format we used was:

--pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H" parrent_hash="%P" tree_hash="%T"'

As you can see it’s very simple to use. There are also many more variables available then shown here.

Other necessary argument used were:

–all
Generate the log using all the refs in the refs/ directory. This argument is necessary to retrieve commits from all branches.
–no-merges
Merge commits are ignored since those commits has multiple parent hashes.
–skip
This command skips the given number of commits from the log. The argument is used in conjunction with a splunk search to detect already splunked commits and skiping them when retrieving new data. In shell script following line can retrieve the data:
```
NUMBER_OF_COMMITS_TO_SKIP=`splunk search "index=splunkgit | stats dc(commit_hash) as commitCount" -auth admin:changeme -app Splunkgit | grep -o -P '[0-9]+'`
```

Retrieving the git data

But as you would expect there is no way to execute git log on the remote repo itself. The way to go was to make a local clone of the repository:

git clone --mirror <Repo address> <Directory to save repo>

The –mirror argument ensures that all new refs are fetched from the remote. It also implies the –bare argument, that is no working tree is created for this repository.

The clone operation is only performed when there isn’t a local repository present. Subsequent executions of out git log analyzing script starts with a git fetch to receive all the new data.

Splunk MAX_DAYS_AGO property

Splunk has a property called MAX_DAYS_AGO. This property specifies the oldest date to accept when retrieving data. Dates older than MAX_DAYS_AGO will be shown/searched using the current date. To avoid this we put the following text in props.conf file:


MAX_DAYS_AGO=10000

This will tell splunk that all data that has a name starting with git can be as old as 10000 days which is 27 years.

Next: Retrieving information about committed files.

----------------------------------------------------
Thanks!
Emre Berge Ergenekon

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

Tips & Tricks 3 Min Read

Dashboard Studio: Schedule This!

Announcing the support of scheduled email export for Dashboard Studio.

Tips & Tricks 2 Min Read

AWS Technical Add-on: Simplifying Error Data Re-ingestion

Ranjit Kalidasan of AWS and Splunk's Antoni Komorowski share a significant update to the AWS Technical Add-on for Splunk.

Tips & Tricks 4 Min Read

Splunk and AWS: Monitoring Metrics in a Serverless World

Integrating Splunk Add-On for AWS provides a serverless management experience, saves costs, and makes data collection, gathering and taming simple.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk