
There’s something to be said about the power of command line interfaces. For simple things, they are simple. For complex things–well, maybe not so simple. Fortunately, I have a simple problem: I want to index a single file from a Hadoop Distributed File Sytem, HDFS. To do this, I’ll use the CLI for both Splunk and Hadoop.
There are a few things we want to take into account when we index a file. Normally, indexing a log file in Splunk means creating an input to “monitor” that file. This enables you to not only index the file’s current contents, but also index subsequent appends. However, the contents of an HDFS are typically historical files, so in this case, I don’t want to “monitor” for updates. Instead, I want it indexed as a one-time operation. Additionally, if the file is not transferred entirely (for example, there is some interruption), I want the transfer to restart/retry. If at all possible, I want to avoid duplicate events and extraneous retransmissions. Finally, I want to ensure that Splunk saves the associated HDFS location information into the “source” field (and sets the other relevant metadata information).
We’ll Splunk the HDFS files in two steps:
- First, we’ll copy from the HDFS to a local tmp directory. In case of failure this enables us to retry and not have partial results in Splunk (if we attempted to stream it into the index).
- Second, we’ll use “add oneshot” to index the local tmp file, since we don’t want to define a monitor file input
Getting Data Out of HDFS
Let’s say I have a file access_log.1 on HDFS in the /tmp directory.
[boris@hiro bin]$ ./hadoop fs -ls /tmp Found 4 items drwxr-xr-x - boris supergroup 0 2012-03-01 23:29 /tmp/_distcp_logs_csf7kx -rw-r--r-- 1 boris supergroup 235749070 2012-03-07 20:25 /tmp/access_log.1 drwxr-xr-x - boris supergroup 0 2012-03-01 14:55 /tmp/hadoop-boris -rw-r--r-- 1 boris supergroup 64286 2012-03-01 23:29 /tmp/splunkeventdata1330645953427
To copy the file, use distcp with the -update flag, which has rsync-like characteristics.
- distcp allows parallel data transfer with assurance of transfer.
- -update ensures that if the file has been copied already, it will not be copied again.
Run the following:
Note: This assumes that your name node is localhost. If it isn’t, specify your name node. And, note those triple slashes after “file:”.
./hadoop-1.0.0/bin/hadoop distcp -update hdfs://localhost:9000//tmp/access_log.1 file:///tmp/hdfs2splunk
This will copy the HDFS to /tmp/hdfs2splunk/access_log.1.
Oneshotting
Now that the data is stored locally, you can use the Splunk CLI to add it as a one-shot input to the index. (and using parameters, set various attributes of the event data)
% ./splunk/bin/splunk add oneshot /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex
This will:
- One shot the file /tmp/hdfs2splunk/access_log.1.
- Set the sourcetype to “hdfsfile”.
- Set the host to “localhost”, or whatever you specified as your name node.
- Set the source to the hdfs url.
- Add the input to the “testinput” index (assuming that the index exists).
That’s it! If you use the search bar in Splunk and select “real-time search (all time)”, you can see the data streaming in.
Also, if you want to use the REST api to check the status, just go to:
https://localhost:8089/services/data/inputs/oneshot/%252Ftmp%252Fhdfs2splunk%252Faccess_log.1
Nom on
One issue I find irritating about ”add oneshot” inputs is that it is an asynchronous operation. While this might be good for some things, it’s less preferable for other things. Alternatively, there is an (unsupported) option called “nom on” that will return only when Splunk has completed reading the file. (though it is still not a transactional operation, since when it returns, the data can still be in buffers and not on disk yet)
% ./splunk/bin/splunk nom on /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex
Try it out. But, if it doesn’t work, or fails in the future, do not call support (again, it’s unsupported). If it works for you, just be happy and enjoy your few minutes of entertainment. (However, on a serious note, if you like it in concept, or feel a supported call-back mechanism would be useful, I’d be interested in hearing from you.)
Sample Script: hdfs2splunk.sh
The following sample script will accomplish the steps mentioned above. The script takes a file in HDFS as a parameter and indexes it into Splunk. This script requires that you customize several configuration variables in the “UPDATE THIS SECTION” area. Also, this script needs to be run on the same host as Splunk.
#/bin/sh # set -x # This will transfer a single file from Hadoop HDFS to Splunk index # This has several config params. Before you use this, be sure to update them. # UPDATE THIS SECTION (begin) # hostname of the name node HADOOP_NN_HOST=localhost # port of the name node HADOOP_NN_PORT=9000 # your splunk home dir SPLUNK_HOME=./splunk # your hadoop home dir HADOOP_HOME=./hadoop-1.0.0 # tmp dir where the files from hdfs will be copied first before oneshotting TMPDIR=/tmp/hdfs2splunk # this is the index to put the data in, make sure it exists in Splunk SPL_INDEX="testindex" # UPDATE THIS SECTION (end) # the src file you want to copy HDFSFILENAME=$1 SCRIPTNAME=`basename $0` # the tmp local file that will be "one shotted" into splunk LOCALFILENAME=${TMPDIR}/`basename ${HDFSFILENAME}` # full url to the file in hdfs HDFSURL=hdfs://${HADOOP_NN_HOST}:${HADOOP_NN_PORT}/${HDFSFILENAME} # the following are splunk metadata values for the data being indexed # the sourcetype name SPL_SOURCETYPE="hdfsfile" # host value is being set to the name node host SPL_HOST=${HADOOP_NN_HOST} # source is the hdfs url to the file SPL_SOURCE=${HDFSURL} if [ -z $1 ]; then echo "usage: ${SCRIPTNAME} " exit 1 fi if [ ! -d ${TMPDIR} ]; then echo "${SCRIPTNAME}: making ${TMPEDIR}" mkdir -p ${TMPDIR} fi echo "${SCRIPTNAME}: Copying file from ${HDFSURL}..." ${HADOOP_HOME}/bin/hadoop distcp -update ${HDFSURL} file://${TMPDIR} echo "${SCRIPTNAME}: Copied data to ${LOCALFILENAME}" if [ ! -f ${LOCALFILENAME} ]; then echo "${SCRIPTNAME}: ${LOCALFILENAME} not found" exit 1 fi echo "${SCRIPTNAME}: One Shotting to Splunk..." # ${SPLUNK_HOME}/bin/splunk add oneshot ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}" # undocumented "nom on" command #${SPLUNK_HOME}/bin/splunk nom on ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}" echo "${SCRIPTNAME}: done"
Run hdfs2splunk.sh
Here is an example of how to run the script with the access_log.1 file mentioned above.
[boris@hiro hadoopish]$ ./hdfs2splunk.sh /tmp/access_log.1
The output will look something like this:
[boris@hiro hadoopish]$ ./hdfs2splunk.sh /tmp/access_log.1 hdfs2splunk.sh: Copying file from hdfs://localhost:9000//tmp/access_log.1... 12/03/07 21:14:29 INFO tools.DistCp: srcPaths=[hdfs://localhost:9000/tmp/access_log.1] 12/03/07 21:14:29 INFO tools.DistCp: destPath=file:/tmp/hdfs2splunk 12/03/07 21:14:30 INFO tools.DistCp: sourcePathsCount=1 12/03/07 21:14:30 INFO tools.DistCp: filesToCopyCount=1 12/03/07 21:14:30 INFO tools.DistCp: bytesToCopyCount=224.8m 12/03/07 21:14:30 INFO mapred.JobClient: Running job: job_201203011455_0011 12/03/07 21:14:31 INFO mapred.JobClient: map 0% reduce 0% 12/03/07 21:14:47 INFO mapred.JobClient: map 100% reduce 0% 12/03/07 21:14:52 INFO mapred.JobClient: Job complete: job_201203011455_0011 12/03/07 21:14:52 INFO mapred.JobClient: Counters: 21 12/03/07 21:14:52 INFO mapred.JobClient: Job Counters 12/03/07 21:14:52 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15145 12/03/07 21:14:52 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/07 21:14:52 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/07 21:14:52 INFO mapred.JobClient: Launched map tasks=1 12/03/07 21:14:52 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/03/07 21:14:52 INFO mapred.JobClient: File Input Format Counters 12/03/07 21:14:52 INFO mapred.JobClient: Bytes Read=222 12/03/07 21:14:52 INFO mapred.JobClient: File Output Format Counters 12/03/07 21:14:52 INFO mapred.JobClient: Bytes Written=8 12/03/07 21:14:52 INFO mapred.JobClient: FileSystemCounters 12/03/07 21:14:52 INFO mapred.JobClient: HDFS_BYTES_READ=235749450 12/03/07 21:14:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237613194 12/03/07 21:14:52 INFO mapred.JobClient: distcp 12/03/07 21:14:52 INFO mapred.JobClient: Files copied=1 12/03/07 21:14:52 INFO mapred.JobClient: Bytes copied=235749070 12/03/07 21:14:52 INFO mapred.JobClient: Bytes expected=235749070 12/03/07 21:14:52 INFO mapred.JobClient: Map-Reduce Framework 12/03/07 21:14:52 INFO mapred.JobClient: Map input records=1 12/03/07 21:14:52 INFO mapred.JobClient: Physical memory (bytes) snapshot=83595264 12/03/07 21:14:52 INFO mapred.JobClient: Spilled Records=0 12/03/07 21:14:52 INFO mapred.JobClient: CPU time spent (ms)=3120 12/03/07 21:14:52 INFO mapred.JobClient: Total committed heap usage (bytes)=124452864 12/03/07 21:14:52 INFO mapred.JobClient: Virtual memory (bytes) snapshot=558563328 12/03/07 21:14:52 INFO mapred.JobClient: Map input bytes=122 12/03/07 21:14:52 INFO mapred.JobClient: Map output records=0 12/03/07 21:14:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=158 hdfs2splunk.sh: Copied data to /tmp/hdfs2splunk/access_log.1 hdfs2splunk.sh: One Shotting to Splunk... hdfs2splunk.sh: done [boris@hiro hadoopish]$
Follow-on items
This script leaves room for improvement, which I’ll leave as an exercise to the user. Some examples of these are:
- Copy multiple files or directory trees.
- Invoke distcp with -update, again (just to be sure), before indexing.
- Clean the tmp area after indexing is complete.
- Parameterize the metadata on the events in Splunk at the script’s command line.
Another cool thing is that if you want to distribute this data to a set of indexers, you can do this operation of a forwarder which is configured to auto load balance across the indexers.
Talk to me
I’d like to hear from you on ideas for improvements, errors/problems you see, etc.
Credits
Thanks to Amrit and Petter for answering my Splunk questions, and Sophy for editing.
I based using distcp on this blog post, http://blog.rapleaf.com/dev/2009/06/11/multiple-ways-of-copying-data-out-of-hdfs/
Happy Hadooping. Happy Splunking.
----------------------------------------------------
Thanks!
Boris Chen