Update: now with UI setup instructions
Summary of what we’ll do
1. Set up the environment
2. Configure Hunk
3. Analyze some data
So let’s get started ..
Minutes 0 – 20: Set up the environment
In order to get up an running with Hunk you’ll need the following software packages available/installed in the server running Hunk:
1. Hunk bits – download Hunk and you can play with it free for 60 days
2. JAVA – at least version 1.6 (or whatever is required by the Hadoop client libraries)
3. Hadoop client libraries – you can get these from the Hadoop vendor that you’re using or if you’re using the Apache distro you can fetch them from here
Installing the Hunk bits is pretty straightforward:
#1. untar the package > tar -xvf splunk-6.0-<BUILD#>-Linux-x86_64.tgz #2. start Splunk > ./splunk/bin/splunk start
Download and follow the instructions for installing/updating Java and the Hadoop client libraries and make sure you keep note of JAVA_HOME and HADOOP_HOME as we’ll need it in the next section.
Minutes 20 – 40: Configure Hunk using the UI
Configuring Hunk can be done either by (a) using our Manager UI interface by going to Settings > Virtual Indexes or (b) through editing conf files, indexes.conf. Here we’ll cover both methods starting with the UI first (thus Minutes 20-40 appear twice)
6. The main configuration requirement for a virtual index is a path that points to the data you want the virtual index to represent. You can optionally specify a whitelist regex that matches only the files you want to be part of the index. Also, if the data is partitioned using time, as in my case, you can also tell Hunk about how the time partitioning is implemented (read the this section if your’re interested in how time partitioning works)
You can skip the next section if you’re not interested in learning how to configure Hunk using the conf files.
Minutes 20 – 40: Configure Hunk using the conf files
In this section I’ll walk you configuring Hunk using the configuration files. We are going to work with the following file:
First: we need to tell Hunk about the Hadoop cluster where the data resides and how to communicate with it – in Hunk terminology this would be an “External Results Provider” (ERPs). The following stanza shows an example of how we define a Hunk ERP.
[provider:hadoop-dev01] # this exact setting is required vix.family = hadoop # location of the Hadoop client libraries and Java vix.env.HADOOP_HOME = /opt/hadoop/hadoop-dev01 vix.env.JAVA_HOME = /opt/java/latest/ # job tracker and default file system vix.fs.default.name = hdfs://hadoop-dev01-nn.splunk.com:8020 vix.mapred.job.tracker = hadoop-dev01-jt.splunk.com:8021 # uncomment this line if you're running Hadoop 2.0 with MRv1 #vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-s6.0-h2.0.jar vix.splunk.home.hdfs = /home/ledion/hunk vix.splunk.setup.package = /opt/splunkbeta-6.0-171187-Linux-x86_64.tgz
Most of the above configs are self explanatory, however I will take a few lines to explain some of them:
This must start with “provider:” in order for Hunk to treat it as an ERP, the rest of the string is the name of the provider, so feel free to get more creative than me
This is a path in HDFS (or whatever the default file system is) that you want this Hunk instance to use as it’s working directory (scratch space)
This is a path in the Hunk server where Hunk can find a Linux x86_64 Hunk package which will be shipped and used on the TaskTracker/DataNodes.
Second: we need to define a virtual index which will contain the data that we want to analyze. For this post I’m going to use Apache access log data which is partitioned by date and is stored in HDFS in a directory structure that looks like this:
/home/ledion/data/weblogs/20130628/access.log.gz /home/ledion/data/weblogs/20130627/access.log.gz /home/ledion/data/weblogs/20130626/access.log.gz ....
Now, let’s configure a virtual index (in the same indexes.conf file as above) that encapsulates this data
[hunk] # name of the provider stanza we defined above # without the "provider:" prefix vix.provider = hadoop-dev01 # path to data that this virtual index encapsulates vix.input.1.path = /home/ledion/data/weblogs/... vix.input.1.accept = /access\.log\.gz$ vix.input.1.ignore = ^$ # (optional) time range extraction from paths vix.input.1.et.regex = /home/ledion/data/weblogs/(\d+) vix.input.1.et.format = yyyyMMdd vix.input.1.et.offset = 0 vix.input.1.lt.regex = /home/ledion/data/weblogs/(\d+) vix.input.1.lt.format = yyyyMMdd vix.input.1.lt.offset = 86400
There are a number of things to note in the virtual index stanza definition:
Points to a directory under the default file system (e.g. HDFS) of the provider where the data of this virtual index lives. NOTE: the “…” at the end of the path denote that Hunk should recursively include the content of subdirectories.
vix.input.1.accept and vix.input.1.ignore allow you to specify regular expressions to filter in/out files (based on the full path) that should/not be considered part of this virtual index. Note that ignore takes precedence over accept. In the above example vix.input.1.ignore is not needed, but I included it to illustrate its availability. A common use case for using it is to ignore temporary files, or files that are currently being written to.
So far so good, but what the heck is all that “.et/lt” stuff?
Glad you asked In case you are not familiar with Splunk, time is a first class concept in Splunk and thus by extension in Hunk too. Given that the data is organized in a directory structure using date partitioning (and this is a very common practice) the “.et/lt” stuff is used to tell Hunk the time range of data that it can expect to find under a directory. The logic goes like this: match the regular expression against the path, concatenate all the capturing groups, then interpret that string using the given format string and finally add/subtract a number of seconds (offset) from the resulting time. The offset comes in handy when you want to extend the extracted time range to build some safety, e.g a few minutes of a given day end up in the next/previous day’s dir, or there’s a difference in timezone from the directory structure and the Hunk server. We do the whole time extraction routine twice in order to come up with a time range, ie extract an earliest time and a latest time. When the time range extraction is configured, Hunk is be able to skip/ignore directories/files which fall outside of the search’s time range. In Hunk speak this is known as: time based partition pruning.
Third: we need to tell Hunk how to schematize the data at search time. At this point we’re entering classic Splunk setup and configuration. In order for Hunk to bind a schema to the data we need to edit another configuration file.
We are going to work with the following file:
priority = 100 sourcetype = access_combined
This stanza tells Hunk to assign sourcetype access_combined to all the data in our virtual index (ie all the data under /home/ledion/data/weblogs/). The access_combined sourcetype is defined in $SPLUNK_HOME/etc/system/default/props.conf and defines how access log data should be processed (e.g. each event is a single line, where to find the timestamp and how to extract fields from the raw event)
Minutes 40 – 59: Analyze your Hadoop data
Now we’re ready to start exploring and analyzing our data. We simply run searches against the virtual index data as if it was a native Splunk index. I’m going to show two examples, highlighting data exploration and analytics
1. explore the raw data
2. get a chart showing the status codes over a 30 day window, using daily buckets
index=hunk | timechart span=1d count by status
Minutes 59 – ∞: Keep on Hunking !
There’s an unlimited number of ways to slice, dice and analyze your data with Hunk. Take the latest Hunk bits for a spin – we’d love to get your feedback on how to make Hunk better for you.
Learn more about how you can use Hunk to search images stored in your Hadoop cluster
Stay tuned for another post where I’ll walk you through how to extend the data formats supported by Hunk as well as how to add your own UDFs …