Hunk: Splunk Analytics for Hadoop Intro – Part 2

Now that you know the basic technology behind Hunk, lets take a look at some of the features of Hunk and how they unlock the value of the data resting in Hadoop.

Defining the problem
More and more enterprises these days are storing massive amounts of data in Hadoop, with the goal that someday they will be able to analyze and gain insight from it and ultimately see a positive ROI. Since HDFS is a generic filesystem it can easily store all kinds of data, be it machine data, images, videos, documents etc, if you can put it in a file it can reside in HDFS. However, while storing the data in HDFS is relatively straightforward getting value out of this data is proving to be a daunting task for many. Unlocking value out of the data resting in Hadoop is the primary goal of Hunk.

What customers love about Splunk
So, let’s start with a few things that people love about Splunk, while I don’t claim to have a complete list, here’s a few that our customers boast about (in no particular order)

  • Immediate search feedback
  • Ability to process all kinds of data – i.e late binding schema
  • Ease of setup and rapid time to value
  • when designing Hunk we wanted to make sure that we preserve as many of the things that people love about Splunk and even add a few more. So, let’s take a look at how we were able to achieve each of those goals

    Immediate feedback

    Hadoop was designed to be a batch job processing system, ie you start a job and have no expectation to see any results back (except maybe some status reports) for a long time (ranging from tens of minutes to days). I am not going to argue the merits of immediate feedback, but we knew for a fact that anything “batch” was not going to fly with customers already accustomed to Splunk’s immediate feedback. Our first challenge: how can we provide immediate feedback to users when building on top of a system that was designed for the exact opposite?

    Data processing modes

    There are two widely used computation models for data processing:

    1. Move data to the computation unit – yes, goes completely against what Hadoop stands for, but bear with me. The major key disadvantage to this model is that it has low throughput because of the large network bandwidth required. However, this model also has a very important property, namely low latency

    2. Move computation to the data – this model is at the core of MapReduce and almost exclusively the only computation model used on Hadoop. The major advantage that this model has is data locality, leading to high throughput. However, the increase in throughput comes at the cost of latency – thus the batch nature of Hadoop. A MapReduce job (and anything built on top of them, Pig jobs, Hive queries etc) could takes tens of seconds all the way to minutes to even setup, let alone get scheduled and executed.

    So, above I’ve described the two ends of the spectrum: low latency, low throughput and high latency, high throughput. What we’re actually looking for in solving our challenge is low latency, however we don’t want to give up on throughput, ie we need low latency and high throughput.

    Now there’s nothing that says that one and only one of the above models can be used at a time. Do you see where I am going … maybe you’ve already thought of a solution, but here’s ours. In order to give users immediate feedback we start moving data to the compute unit, also known as a Search Head (we call this streaming mode) and concurrently we start moving computation to the data (a MapReduce job). While the MR job is setting up, starting and finally producing some results we display results from the streaming component, then as soon as some MapReduce tasks complete we stop streaming and consume the MR job results. Thus achieving low latency (via streaming) and high throughput (via MapReduce) – who said you can’t have it all? (I’m leave the costs of this method as an exercise for the reader.)

    Late binding schema

    Splunk uses a combination of early and late binding schema. Even though most users care about the flexibility of our search time schema binding, they’re usually unaware that there’s also some minimal index time schema applied to the data. When Splunk ingests data, it first breaks the data stream into events, performs timestamp extraction, source typing etc. Both of these schema applications are important and necessary to allow maximum flexibility in the type of data that can be processed by Hunk. However, in Hunk we could be asked to analyze data that did not necessarily end up in Hadoop via Splunk (or Hunk) – ie it’s either already resting in HDFS or getting there via some other mechanism, e.g. Flume, custom application etc. So, in Hunk we’ve implemented truly late binding schema – ie all the index time processing as well as all the search time processing are all applied to at search time. However, this does not mean that we are creating an index in HDFS, just the index time processing. We treat the HDFS data placed in virtual indexes in Hunk as a read only data source. For those already familiar with Splunk’s index time processing pipeline the following picture depicts the data flows in Hunk:

    I mentioned that we wanted to preserve all the things that people love about Splunk and maybe even add more. The data processing pipeline is something where we’ve added something – before data is even processed by Hunk we allow you to plug in your own data preprocessor. The preprocessors have to be written in Java and have a chance to transform the data in some way before Hunk gets a chance to – they can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.

    Ease of setup and rapid time to value
    As I mentioned at the beginning of this post most enterprises are having a hard time getting value out of the data stored in Hadoop. So in Hunk we aimed at making the setup/installation and getting started experience as easy as possible. To this end, the setup is as simple as telling us (a) some key info about the Hadoop cluster, such as NameNode/JobTracker host and port, Hadoop client libraries to use when submitting MR jobs, Kerberos credentials etc and (b) creating virtual indexes that correspond to data in Hadoop.

    In terms of providing a fast time to value we chose to allow users to run analytics searches against the data that rests in Hadoop without Hunk ever seeing/preprocessing the data! The reason for this is that we don’t want you to have to wait for potentially days until Hunk preprocesses the data before you can execute your first search. Some of Hunk Beta customers were able to setup Hunk and start running analytics searches against their Hadoop data within minutes of starting the setup process – yes I said minutes!

    Continue reading the next post on this series, Hunk: Raw data to analytics in < 60 minutes

    Ledion Bitincka

    Posted by