Diagraming Splunk’s data-flow

This blog entry is not about how the framework works. It is about a semi-cool visualization that I created using python and graphviz. If you watched the video where I presented Splunks framework architecture from a high level you know what pipelines and processors are. If you haven’t here is a very quick overview.

  • A pipeline is a thread of execution that lives within the splunkd process. Each pipeline executes a series of processors, each one which operates on data. The data is created when the first processor on the pipeline reads it from some input (like tailing a file, or receiving it on a network port). Each processor then does something to the data. Eventually, the data gets indexed and execution is returned to the first processor to get more data again.
  • Pipelines are connected via queues. A queue output processor (the last processor in a pipeline) puts data on to a queue and blocks if the queue is full. A queue input processor (the first processor at the top of a pipeline) gets the data item from the bottom of the queue and sends it on down the pipeline. If there is no data, it blocks waiting for some to be put on the queue.

Enough already. Go watch the video. So, I decided that I’m tired of drawing these diagrams and wrote some code to produce them for me.

I Implemented some python code that took the composite.xml file, parsed it and produced a .dot file. Composite.xml, for those of you who don’t know is an amalgamation of all pipelines and processors in the system. It represents the current (or last) runtime environment for Splunk. It lives in $SPLUNK_HOME/var/run/splunk.

I then took the resultant .dot file and ran it through graphviz. After lots of tweeking, here is what I came up with. Click on the image to see a larger version which is actually readable.

Results (click to enlarge)
Auto-generated pipeline graph

Python Transformation Code

Untar this. It’s only a single python file, but this blogging software wouldn’t let me upload a .py file.


Future Work

  • Annotate the graph with run time statistics like average per-processor timing, average queue size, max queue size, etc. This would require looking at the logs.
  • Launching this from Splunk, firing off the python along with the metrics data pre-sifted ala Splunk.

Got more ideas? Please post them here.

Posted by


Join the Discussion