Hunk Preprocessors: How to DIY

In the previous blog post on image searching with Splunk, I showed you how you can preprocess data with Hunk to get the ability to Splunk any data. This blog post is all about how to do it yourself.


Before we start, here are links to the code for the image preprocessor demo:

The first link has all the preprocessor code and the second link has the code for making the sweet image UI. You can look at it before, while and/or after reading the rest of the blog post. Enjoy!


A Hunk preprocessor is basically just a Hadoop’s RecordReader<K, V>, where K is irrelevant and V is Text. We provide a base class that you should extend, where you implement some additional methods to the RecordReader interface and then you’re done! The base class’ full name is

Foreground – It’s just an iterator

Implementing your own preprocessor very similar to writing a java.util.Iterator<E>

  1. new Iterator<E>() { … – Initialize the iterator.
  2. boolean hasNext() – Return true if there are more values.
  3. <E> next() – Return the next value.

Here’s the part of the BaseSplunkRecordReader that you’ll find similar to the Iterator:

  • boolean nextKeyValue() – Prepare your next value.
  • Text getCurrentValue() – Return the current value.
  • void vixInitialize – Like a constructor, where you setup your preprocessor.

This interface is also described in Hadoop’s docs for org.apache.hadoop.mapreduce.RecordReader

In addition, we have some other methods Hunk will call on your preprocessor:

Method Description Required?
getName Name of the record reader. Needed if you want to reach your record reader from the indexes.conf. Yes
getFilePattern Regex pattern that your record reader will accept. This return value can be overridden by configration Ish
getCurrentKey No-op right now. Does nothing. No
close Method for closing your resources. No
getProgress The current progress of the record reader through its data. Return: a number between 0.0 and 1.0 that is the fraction of the data read. No
getOutputDataFormat Return “json” or “xml” if you’ve encoded the value as such No

Hunk configuration

To use your preprocessor with hunk, you need to configure an the virtual index to point to the .jar file which has the preprocessor classes and also all the jars it depends on. You will also need to add the full class name (including all packages) to the list of configured record readers. You can also set one or many regexes to your preprocessor, that when matching a path to a file, will use your preprocessor. This might sound complicated at first, but here’s all you need to do:

    1. Create a jar with your custom record reader.
    2. Add your custom record reader to indexes.conf:

vix.splunk.jars = /absolute/path/to/your/jar,/other/path/to/library/jar #comma separated list with jars = com.all.packages.and.ClassName #comma separated list with record readers<name>.regex = #regexes to match your record reader. Where <name> is the String that's returned from your implementation of BaseSplunkRecordReader#getName()

Reusing existing record readers

There are already a lot of written Hadoop Record Reader’s for a lot of different data types. And it’s usually really easy to wrap these pre-existing record readers in our BaseSplunkRecordReader. Here’s how you do it:

  • Plugin the vixInitialize, nextKeyValue and getCurrentValue to the pre-existing record reader.
  • Give the preprocessor a name in getName().
  • Configure the new processor in indexes.conf

That’s it!

And that’s also it for this post on how to write your own preprocessor! I recommend that you take another glance at the image search blog post, now that you have a better understanding of what’s going on. Happy processing!

Petter Eriksson

Posted by