Digital Resilience Pays Off
Download this e-book to learn about the role of Digital Resilience across enterprises.
Update 9/27/16: As of Sept. 27, 2016, Hunk functionality has been incorporated into the Splunk Analytics for Hadoop Add-On and Splunk Enterprise versions 6.5 and later.
This year was my first .conf, and it was an amazingly fun experience! During the keynote, we announced a number of new Hunk features, one of which was the Splunk Archive Bucket Reader. This tool allows you to read Splunk raw data journal files using any Hadoop application that allows the user to configure which InputFormat implementation is used. In particular, if you are using Hunk archiving to copy your indexes onto HDFS, you can now query and analyze the archived data from those indexes using whatever your organization’s favorite Hadoop applications are (e.g. Hive, Pig, Spark). This will hopefully be the first of a series of posts showing in detail how to integrate with these systems. This post is going to cover some general information about using Archive Bucket Reader, and then will discuss how to use it with Hive.
Getting to Know the Splunk Archive Bucket Reader
The Archive Bucket Reader is packaged as a Splunk app, and is available for free here.
It provides implementations of Hadoop classes that read Splunk raw data journal files, and make the data available to Hadoop jobs. In particular, it implements an InputFormat and a RecordReader. These will make available any index-time fields contained in a journal file. This usually includes, at a minimum, the original raw text of the event, the host, source, and sourcetype fields, the event timestamp, and the time the event was indexed. It cannot make available search-time fields, as these are not kept in the journal file. More details are available in the online documentation.
Now let’s get started. If you haven’t already, install the app from the link above. If your Hunk user does not have adequate permissions, you may need the assistance of a Hunk administrator for that step.
Log onto Hunk, and look at your home screen. You should see a “Bucket Reader” icon on the left side of the screen. Click on this. You should see a page of documentation, like this:
Take some time and look around this page. There is lots of good information, including how to configure Archive Bucket Reader to get the fields you want.
Click on the Downloads tab at the top of the page. You should see the following:
There are two links for downloading the jar file you will need. If you are using a Hadoop version of 2.0 or greater (including any version of Yarn), click the second link. Otherwise, click the first link. Either way, your browser will begin downloading the corresponding jar to your computer.
Using Hive with Splunk Archive Bucket Reader
We’ll assume that you already have a working Hive installation. If not, you can find more information about installing and configuring Hive here.
We need to take the jar we downloaded in the last section, and make it available to Hive. It needs to be available both to the local client, and on the Hadoop cluster where our commands will be executed. The easiest way to do this is to use the “auxpath” argument when starting Hive, with the path to the jar file. For example:
hive --auxpath /home/hive/splunk-bucket-reader-2.0beta.jar
If you forget this step, you may get class-not-found errors in the following steps. Now let’s create a Hive table backed by a jounal.gz file. Enter the following into your Hive command-line:
CREATE EXTERNAL TABLE splunk_event_table ( Time DATE, Host STRING, Source STRING, date_wday STRING, date_mday INT ) ROW FORMAT SERDE 'com.splunk.journal.hive.JournalSerDe' WITH SERDEPROPERTIES ( "com.splunk.journal.hadoop.value_format" = "_time,host,source,date_wday,date_mday" ) STORED AS INPUTFORMAT 'com.splunk.journal.hadoop.mapred.JournalInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/hive/user_data’;
If this was successful, you should see something like this:
OK Time taken: 0.595 seconds
Let’s look at a few features of this “create table” statement.
STORED AS INPUTFORMAT 'com.splunk.journal.hadoop.mapred.JournalInputFormat'
tells Hive that we want to use the JournalInputFormat class to read the data files. This class is located in the jar file that we told Hive about when we started the command-line. Note the use of “mapred” instead of “mapreduce”—Hive requires “old-style” Hadoop InputFormat classes, instead of new-style. Both are available in the jar.
ROW FORMAT SERDE 'com.splunk.journal.hive.JournalSerDe' WITH SERDEPROPERTIES ( "com.splunk.journal.hadoop.value_format" = "_time,host,source,date_wday,date_mday" )
tell Hive with fields we want to pull from the journal files to use in the table. See the app documentation for more detail about which fields are available. Note that we are invoking another class from the Archive Bucket Reader jar, JournalSerDe. “SerDe” stands for serializer-deserializer.
(Time DATE, Host STRING, Source STRING, date_wday STRING, date_mday INT)
tells Hive how we want the columns to be presented to the user. Note that there are the same number of columns here as in the SERDEPROPERTIES clause. This section could be left out altogether, in which case each field would be treated as a string, and would have the name it has in the journal file, e.g. _time as a string instead of Time as a date.
Now that you have a Hive table backed by a Splunk journal file, let’s practice using it. Try the following queries:
select * from splunk_event_table limit 10; select count(*) from splunk_event_table group by host; select min(time) from splunk_event_table;
Hopefully that’s enough to get you started. Happy analyzing!
----------------------------------------------------
Thanks!
Keith Schon
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.