
One of the sexy new features Hunk brings to the Splunk 6 smorgasbord, is preprocessing data. Since Hunk is built on top of Hadoop’s MapReduce framework, we can utilize it’s preprocessing framework. Basically, now you can take any data, write a piece of code that turns it into text, then search where it is stored!
Update: Code is open sourced here!
I’ve created a demo where you can select colors and get images that match the selection. It looks like this:
Image searching in Splunk? How is this possible? Indexing images?
Indexing images, no. Preprocessing at search time. There are no indexing costs.
I do this by searching a set of images stored on HDFS, my preprocessor extracts the color distribution of the images at search time, lastly my search returns me images sorted by how well they match my colors of choice.
How I did it
There are three parts to doing this:
- The Preprocessor
- Splunk Search
- Splunk 6 UI
The Preprocessor
Preprocessing is done by creating a custom Hadoop Record Reader in Java. Here’s a short description of the preprocessor I made for the image demo:
- Input: Image
- Output: Buckets of color ranges with percentages of total pixels that match the buckets color range.
- Output example:
{"colors" : [[[0.12, 4.34, 8.23, ...] ...] ...], "image":"/path"}
You can imagine adding more data such as: Geo Location, size of image, camera type and faces. I just created this simple preprocessor for the sake of the demo.
Splunk Search
I’ve created a search that scores images based on chosen colors and sorts the images by this score.
The first search I created was something like:
search index=images | eval score=color1+color2+…+colorN | sort -score by image
However, if an image has 100% of its pixels in any of the colors 1 to N, it would get a perfect score of 100, and that’s not great. I want the images that contain many of the selected colors to score higher, so I need to multiply the color values instead of add them together. This is what the search looks like when I’ve selected two colors:
search index=images AND colors | spath path=colors{1}{1}{7} output=color17 | spath path=colors{1}{2}{8} output=color28 | eval score=(1 + color17)*(1 + color28)*1 | eval score_pct=if(2<2, score, 100*(log(score,3))/log(pow((floor(100/3))+1,3),3)) | stats sum(score_pct) as relevance by image, source | sort -relevance
Notice that I didn’t do anything to my preprocessor when I wanted my search to change. This is great.
And remember, this will spawn a Hadoop MapReduce job. Writing MapReduce jobs in Splunk’s search language is a very pleasant experience.
If you payed a lot of attention to the search, you might have noticed that some things are quite strange [like if(2<2,…)]. This is because I’m generating it with Javascript, which I’ll briefly go through next.
Splunk 6 UI
I won’t write much about this, because it’s not the main focus of the blog and I’m not a UI guy at all, but it has to be mentioned. Creating custom UI’s in Splunk 6 is pretty sweet!
Here’s how to do it:
- Create a dashboard
- Convert to HTML/Javascript
- Write code with your HTML and Javascript skills.
If you created a dashboard with your search and created one or many charts, you already have an up and running working UI. All you need to do now is extend the current UI and there are a lot of classes and other UI guck that really help with doing that.
I generated the grid of colors with checkboxes, created a 1×3 grid of divs where I can put my images, and then I get the images with a web service that goes to Hadoop, then I display the images in my divs.
I’ve also implemented infinite scrolling and some front end caching for performance. I had a ton of fun making this.
The code for this view can be shown here: https://gist.github.com/petterik/5f6d95fdd3138188ae16
A lot of it is auto generated, as you can see on line 269. I did however edit some of the lines above it inline, without removing the comment.
More Preprocessing
I want to leave you with some more ideas about what you can do with this feature. Here are some other data types you might want to preprocess:
- Packet capture (pcap)
- Music
- Voice
- Video
- Your companies proprietary file format that’s already on Hadoop
So get hyped and stay tuned for more blogs on Hunk and preprocessing. How to write one yourself, making your preprocessor smarter, code examples and other tips and tricks.
We’re just getting started 😉
----------------------------------------------------
Thanks!
Petter Eriksson