I just went through a quick exercise to see how I might get data from a system using collectd into Splunk. In fact, it was so quick, I saved enough time to write a blog post about it. Anyone who’s messed with collectd will already be very familiar with RRDtool and RRD databases. If that’s where your thought process went when you saw the title of this blog post: stop. Once the data is in RRD, it’s already stale. Nothing wrong with RRD, it’s great stuff, but tossing things there and then going through a batch process to export from there into Splunk is wholly unnecessary. (On the other hand, if you’ve already got a lot of RRD files, there’s an app for that.)
So, let’s skip the RRD technique and move on to a way to stream data straight from collectd into Splunk. There’s a tool out there called graphite which I came across today which got me to thinking about this collectd question in the first place. Turns out that collectd has a plugin to write data out to graphite. Guess what? You can use that plugin to move data into Splunk without any additional customization, beyond creating a knowledge object or three in Splunk.
The process looks like this:
- Configure collectd to use the write_graphite output plugin
- Point the “host” field in the plugin to your Splunk indexer, and pick an open TCP port
- Create a new TCP input in Splunk, and match the port number to the one used in step #2.
Once you have that working, you’ll need to do a minimal amount of sourcetyping and field extraction. The graphite “plaintext format” looks like this:
<metric path> <metric value> <metric timestamp>
And some sample data from the collectd plugin page looks like this:
myhost_example_com.cpu-2.cpu-idle 98.6103 1329168255 myhost_example_com.cpu-2.cpu-nice 0 1329168255 myhost_example_com.cpu-2.cpu-user 0.800076 1329168255
I chose to put this data in a file and use the data preview feature to see if the timestamp would be picked up automatically, and it was. Other than creating a custom sourcetype name for this new data type, no props.conf settings were required.
Next, I took a few minutes to work up some field extractions using IFX, which produced this regex (which I didn’t edit for efficiency, but it works):
^(?P<collectd_host>[^\.]+)[^\.\n]*\.(?P<object>[^\.]+)\.(?P<metric>[^ ]+)\s+(?P<value>[^ ]+)
That’s pretty much all it takes! However, there is something else I might suggest if you were to do this in production. It looks like periods are delimiters in this event format, so by default, the dots in a hostname will be replaced by underscores. It would be a good idea to use props/transforms to fix this so that there is a reliable hostname field to search on, as opposed to the eval command I used in the below screenshot. Additionally, you may decide that the hostname in each record should be “the” host field that Splunk ties to every event at index time. We explain how you can do this in the docs: Set host values based on event data.
End result looks like this: