Simple Splunking of HDFS Files

om nom nomThere’s something to be said about the power of command line interfaces. For simple things, they are simple. For complex things–well, maybe not so simple. Fortunately, I have a simple problem: I want to index a single file from a Hadoop Distributed File Sytem, HDFS. To do this, I’ll use the CLI for both Splunk and Hadoop.

There are a few things we want to take into account when we index a file. Normally, indexing a log file in Splunk means creating an input to “monitor” that file. This enables you to not only index the file’s current contents, but also index subsequent appends. However, the contents of an HDFS are typically historical files, so in this case, I don’t want to “monitor” for updates. Instead, I want it indexed as a one-time operation. Additionally, if the file is not transferred entirely (for example, there is some interruption), I want the transfer to restart/retry. If at all possible, I want to avoid duplicate events and extraneous retransmissions. Finally, I want to ensure that Splunk saves the associated HDFS location information into the “source” field (and sets the other relevant metadata information).

We’ll Splunk the HDFS files in two steps:

  1. First, we’ll copy from the HDFS to a local tmp directory. In case of failure this enables us to retry and not have partial results in Splunk (if we attempted to stream it into the index).
  2. Second, we’ll use “add oneshot” to index the local tmp file, since we don’t want to define a monitor file input

Getting Data Out of HDFS

Let’s say I have a file access_log.1 on HDFS in the /tmp directory.

[boris@hiro bin]$ ./hadoop fs -ls /tmp
Found 4 items
drwxr-xr-x   - boris supergroup          0 2012-03-01 23:29 /tmp/_distcp_logs_csf7kx
-rw-r--r--   1 boris supergroup  235749070 2012-03-07 20:25 /tmp/access_log.1
drwxr-xr-x   - boris supergroup          0 2012-03-01 14:55 /tmp/hadoop-boris
-rw-r--r--   1 boris supergroup      64286 2012-03-01 23:29 /tmp/splunkeventdata1330645953427

To copy the file, use distcp with the -update flag, which has rsync-like characteristics.

  • distcp allows parallel data transfer with assurance of transfer.
  • -update ensures that if the file has been copied already, it will not be copied again.

Run the following:

Note: This assumes that your name node is localhost. If it isn’t, specify your name node. And, note those triple slashes after “file:”.

./hadoop-1.0.0/bin/hadoop distcp -update hdfs://localhost:9000//tmp/access_log.1 file:///tmp/hdfs2splunk

This will copy the HDFS to /tmp/hdfs2splunk/access_log.1.


Now that the data is stored locally, you can use the Splunk CLI to add it as a one-shot input to the index. (and using parameters, set various attributes of the event data)

% ./splunk/bin/splunk add oneshot /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex

This will:

  • One shot the file /tmp/hdfs2splunk/access_log.1.
  • Set the sourcetype to “hdfsfile”.
  • Set the host to “localhost”, or whatever you specified as your name node.
  • Set the source to the hdfs url.
  • Add the input to the “testinput” index (assuming that the index exists).

That’s it! If you use the search bar in Splunk and select “real-time search (all time)”, you can see the data streaming in.

Also, if you want to use the REST api to check the status, just go to:


Nom on

One issue I find irritating about ”add oneshot” inputs is that it is an asynchronous operation. While this might be good for some things, it’s less preferable for other things. Alternatively, there is an (unsupported) option called “nom on” that will return only when Splunk has completed reading the file. (though it is still not a transactional operation, since when it returns, the data can still be in buffers and not on disk yet)

% ./splunk/bin/splunk nom on /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex

Try it out. But, if it doesn’t work, or fails in the future, do not call support (again, it’s unsupported). If it works for you, just be happy and enjoy your few minutes of entertainment. (However, on a serious note, if you like it in concept, or feel a supported call-back mechanism would be useful, I’d be interested in hearing from you.)

Sample Script:

The following sample script will accomplish the steps mentioned above. The script takes a file in HDFS as a parameter and indexes it into Splunk. This script requires that you customize several configuration variables in the “UPDATE THIS SECTION” area. Also, this script needs to be run on the same host as Splunk.

# set -x

# This will transfer a single file from Hadoop HDFS to Splunk index
# This has several config params. Before you use this, be sure to update them.


# hostname of the name node
# port of the name node
# your splunk home dir
# your hadoop home dir
# tmp dir where the files from hdfs will be copied first before oneshotting

# this is the index to put the data in, make sure it exists in Splunk


# the src file you want to copy
SCRIPTNAME=`basename $0`

# the tmp local file that will be "one shotted" into splunk
# full url to the file in hdfs

# the following are splunk metadata values for the data being indexed

# the sourcetype name
# host value is being set to the name node host
# source is the hdfs url to the file

if [ -z $1 ]; then
	echo "usage: ${SCRIPTNAME} "
	exit 1
if [ ! -d ${TMPDIR} ]; then
	echo "${SCRIPTNAME}: making ${TMPEDIR}"
	mkdir -p ${TMPDIR}

echo "${SCRIPTNAME}: Copying file from ${HDFSURL}..."
${HADOOP_HOME}/bin/hadoop distcp -update  ${HDFSURL} file://${TMPDIR}
echo "${SCRIPTNAME}: Copied data to ${LOCALFILENAME}"
if [ ! -f ${LOCALFILENAME} ]; then
	echo "${SCRIPTNAME}: ${LOCALFILENAME} not found"
	exit 1
echo "${SCRIPTNAME}: One Shotting to Splunk..."

${SPLUNK_HOME}/bin/splunk add oneshot ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}"

# undocumented "nom on" command
#${SPLUNK_HOME}/bin/splunk nom on ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}"

echo "${SCRIPTNAME}: done"


Here is an example of how to run the script with the access_log.1 file mentioned above.

[boris@hiro hadoopish]$ ./ /tmp/access_log.1

The output will look something like this:

[boris@hiro hadoopish]$ ./ /tmp/access_log.1 Copying file from hdfs://localhost:9000//tmp/access_log.1...
12/03/07 21:14:29 INFO tools.DistCp: srcPaths=[hdfs://localhost:9000/tmp/access_log.1]
12/03/07 21:14:29 INFO tools.DistCp: destPath=file:/tmp/hdfs2splunk
12/03/07 21:14:30 INFO tools.DistCp: sourcePathsCount=1
12/03/07 21:14:30 INFO tools.DistCp: filesToCopyCount=1
12/03/07 21:14:30 INFO tools.DistCp: bytesToCopyCount=224.8m
12/03/07 21:14:30 INFO mapred.JobClient: Running job: job_201203011455_0011
12/03/07 21:14:31 INFO mapred.JobClient:  map 0% reduce 0%
12/03/07 21:14:47 INFO mapred.JobClient:  map 100% reduce 0%
12/03/07 21:14:52 INFO mapred.JobClient: Job complete: job_201203011455_0011
12/03/07 21:14:52 INFO mapred.JobClient: Counters: 21
12/03/07 21:14:52 INFO mapred.JobClient:   Job Counters
12/03/07 21:14:52 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15145
12/03/07 21:14:52 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/07 21:14:52 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/03/07 21:14:52 INFO mapred.JobClient:     Launched map tasks=1
12/03/07 21:14:52 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/03/07 21:14:52 INFO mapred.JobClient:   File Input Format Counters
12/03/07 21:14:52 INFO mapred.JobClient:     Bytes Read=222
12/03/07 21:14:52 INFO mapred.JobClient:   File Output Format Counters
12/03/07 21:14:52 INFO mapred.JobClient:     Bytes Written=8
12/03/07 21:14:52 INFO mapred.JobClient:   FileSystemCounters
12/03/07 21:14:52 INFO mapred.JobClient:     HDFS_BYTES_READ=235749450
12/03/07 21:14:52 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237613194
12/03/07 21:14:52 INFO mapred.JobClient:   distcp
12/03/07 21:14:52 INFO mapred.JobClient:     Files copied=1
12/03/07 21:14:52 INFO mapred.JobClient:     Bytes copied=235749070
12/03/07 21:14:52 INFO mapred.JobClient:     Bytes expected=235749070
12/03/07 21:14:52 INFO mapred.JobClient:   Map-Reduce Framework
12/03/07 21:14:52 INFO mapred.JobClient:     Map input records=1
12/03/07 21:14:52 INFO mapred.JobClient:     Physical memory (bytes) snapshot=83595264
12/03/07 21:14:52 INFO mapred.JobClient:     Spilled Records=0
12/03/07 21:14:52 INFO mapred.JobClient:     CPU time spent (ms)=3120
12/03/07 21:14:52 INFO mapred.JobClient:     Total committed heap usage (bytes)=124452864
12/03/07 21:14:52 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=558563328
12/03/07 21:14:52 INFO mapred.JobClient:     Map input bytes=122
12/03/07 21:14:52 INFO mapred.JobClient:     Map output records=0
12/03/07 21:14:52 INFO mapred.JobClient:     SPLIT_RAW_BYTES=158 Copied data to /tmp/hdfs2splunk/access_log.1 One Shotting to Splunk... done
[boris@hiro hadoopish]$

Follow-on items

This script leaves room for improvement, which I’ll leave as an exercise to the user. Some examples of these are:

  • Copy multiple files or directory trees.
  • Invoke distcp with -update, again (just to be sure), before indexing.
  • Clean the tmp area after indexing is complete.
  • Parameterize the metadata on the events in Splunk at the script’s command line.

Another cool thing is that if you want to distribute this data to a set of indexers, you can do this operation of a forwarder which is configured to auto load balance across the indexers.

Talk to me

I’d like to hear from you on ideas for improvements, errors/problems you see, etc.


Thanks to Amrit and Petter for answering my Splunk questions, and Sophy for editing.

I based using distcp on this blog post,

Happy Hadooping. Happy Splunking.

Boris Chen

Posted by