Digital Resilience Pays Off
Download this e-book to learn about the role of Digital Resilience across enterprises.
There’s something to be said about the power of command line interfaces. For simple things, they are simple. For complex things–well, maybe not so simple. Fortunately, I have a simple problem: I want to index a single file from a Hadoop Distributed File Sytem, HDFS. To do this, I’ll use the CLI for both Splunk and Hadoop.
There are a few things we want to take into account when we index a file. Normally, indexing a log file in Splunk means creating an input to “monitor” that file. This enables you to not only index the file’s current contents, but also index subsequent appends. However, the contents of an HDFS are typically historical files, so in this case, I don’t want to “monitor” for updates. Instead, I want it indexed as a one-time operation. Additionally, if the file is not transferred entirely (for example, there is some interruption), I want the transfer to restart/retry. If at all possible, I want to avoid duplicate events and extraneous retransmissions. Finally, I want to ensure that Splunk saves the associated HDFS location information into the “source” field (and sets the other relevant metadata information).
We’ll Splunk the HDFS files in two steps:
Let’s say I have a file access_log.1 on HDFS in the /tmp directory.
[boris@hiro bin]$ ./hadoop fs -ls /tmp Found 4 items drwxr-xr-x - boris supergroup 0 2012-03-01 23:29 /tmp/_distcp_logs_csf7kx -rw-r--r-- 1 boris supergroup 235749070 2012-03-07 20:25 /tmp/access_log.1 drwxr-xr-x - boris supergroup 0 2012-03-01 14:55 /tmp/hadoop-boris -rw-r--r-- 1 boris supergroup 64286 2012-03-01 23:29 /tmp/splunkeventdata1330645953427
To copy the file, use distcp with the -update flag, which has rsync-like characteristics.
Run the following:
Note: This assumes that your name node is localhost. If it isn’t, specify your name node. And, note those triple slashes after “file:”.
./hadoop-1.0.0/bin/hadoop distcp -update hdfs://localhost:9000//tmp/access_log.1 file:///tmp/hdfs2splunk
This will copy the HDFS to /tmp/hdfs2splunk/access_log.1.
Now that the data is stored locally, you can use the Splunk CLI to add it as a one-shot input to the index. (and using parameters, set various attributes of the event data)
% ./splunk/bin/splunk add oneshot /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex
This will:
That’s it! If you use the search bar in Splunk and select “real-time search (all time)”, you can see the data streaming in.
Also, if you want to use the REST api to check the status, just go to:
https://localhost:8089/services/data/inputs/oneshot/%252Ftmp%252Fhdfs2splunk%252Faccess_log.1
One issue I find irritating about ”add oneshot” inputs is that it is an asynchronous operation. While this might be good for some things, it’s less preferable for other things. Alternatively, there is an (unsupported) option called “nom on” that will return only when Splunk has completed reading the file. (though it is still not a transactional operation, since when it returns, the data can still be in buffers and not on disk yet)
% ./splunk/bin/splunk nom on /tmp/hdfs2splunk/access_log.1 -sourcetype hdfsfile -host localhost -rename-source hdfs://localhost:9000//tmp/access_log.1 -index testindex
Try it out. But, if it doesn’t work, or fails in the future, do not call support (again, it’s unsupported). If it works for you, just be happy and enjoy your few minutes of entertainment. (However, on a serious note, if you like it in concept, or feel a supported call-back mechanism would be useful, I’d be interested in hearing from you.)
The following sample script will accomplish the steps mentioned above. The script takes a file in HDFS as a parameter and indexes it into Splunk. This script requires that you customize several configuration variables in the “UPDATE THIS SECTION” area. Also, this script needs to be run on the same host as Splunk.
#/bin/sh # set -x # This will transfer a single file from Hadoop HDFS to Splunk index # This has several config params. Before you use this, be sure to update them. # UPDATE THIS SECTION (begin) # hostname of the name node HADOOP_NN_HOST=localhost # port of the name node HADOOP_NN_PORT=9000 # your splunk home dir SPLUNK_HOME=./splunk # your hadoop home dir HADOOP_HOME=./hadoop-1.0.0 # tmp dir where the files from hdfs will be copied first before oneshotting TMPDIR=/tmp/hdfs2splunk # this is the index to put the data in, make sure it exists in Splunk SPL_INDEX="testindex" # UPDATE THIS SECTION (end) # the src file you want to copy HDFSFILENAME=$1 SCRIPTNAME=`basename $0` # the tmp local file that will be "one shotted" into splunk LOCALFILENAME=${TMPDIR}/`basename ${HDFSFILENAME}` # full url to the file in hdfs HDFSURL=hdfs://${HADOOP_NN_HOST}:${HADOOP_NN_PORT}/${HDFSFILENAME} # the following are splunk metadata values for the data being indexed # the sourcetype name SPL_SOURCETYPE="hdfsfile" # host value is being set to the name node host SPL_HOST=${HADOOP_NN_HOST} # source is the hdfs url to the file SPL_SOURCE=${HDFSURL} if [ -z $1 ]; then echo "usage: ${SCRIPTNAME} " exit 1 fi if [ ! -d ${TMPDIR} ]; then echo "${SCRIPTNAME}: making ${TMPEDIR}" mkdir -p ${TMPDIR} fi echo "${SCRIPTNAME}: Copying file from ${HDFSURL}..." ${HADOOP_HOME}/bin/hadoop distcp -update ${HDFSURL} file://${TMPDIR} echo "${SCRIPTNAME}: Copied data to ${LOCALFILENAME}" if [ ! -f ${LOCALFILENAME} ]; then echo "${SCRIPTNAME}: ${LOCALFILENAME} not found" exit 1 fi echo "${SCRIPTNAME}: One Shotting to Splunk..." # ${SPLUNK_HOME}/bin/splunk add oneshot ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}" # undocumented "nom on" command #${SPLUNK_HOME}/bin/splunk nom on ${LOCALFILENAME} -sourcetype "${SPL_SOURCETYPE}" -host "${SPL_HOST}" -rename-source "${HDFSURL}" -index "${SPL_INDEX}" echo "${SCRIPTNAME}: done"
Here is an example of how to run the script with the access_log.1 file mentioned above.
[boris@hiro hadoopish]$ ./hdfs2splunk.sh /tmp/access_log.1
The output will look something like this:
[boris@hiro hadoopish]$ ./hdfs2splunk.sh /tmp/access_log.1 hdfs2splunk.sh: Copying file from hdfs://localhost:9000//tmp/access_log.1... 12/03/07 21:14:29 INFO tools.DistCp: srcPaths=[hdfs://localhost:9000/tmp/access_log.1] 12/03/07 21:14:29 INFO tools.DistCp: destPath=file:/tmp/hdfs2splunk 12/03/07 21:14:30 INFO tools.DistCp: sourcePathsCount=1 12/03/07 21:14:30 INFO tools.DistCp: filesToCopyCount=1 12/03/07 21:14:30 INFO tools.DistCp: bytesToCopyCount=224.8m 12/03/07 21:14:30 INFO mapred.JobClient: Running job: job_201203011455_0011 12/03/07 21:14:31 INFO mapred.JobClient: map 0% reduce 0% 12/03/07 21:14:47 INFO mapred.JobClient: map 100% reduce 0% 12/03/07 21:14:52 INFO mapred.JobClient: Job complete: job_201203011455_0011 12/03/07 21:14:52 INFO mapred.JobClient: Counters: 21 12/03/07 21:14:52 INFO mapred.JobClient: Job Counters 12/03/07 21:14:52 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15145 12/03/07 21:14:52 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/07 21:14:52 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/07 21:14:52 INFO mapred.JobClient: Launched map tasks=1 12/03/07 21:14:52 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/03/07 21:14:52 INFO mapred.JobClient: File Input Format Counters 12/03/07 21:14:52 INFO mapred.JobClient: Bytes Read=222 12/03/07 21:14:52 INFO mapred.JobClient: File Output Format Counters 12/03/07 21:14:52 INFO mapred.JobClient: Bytes Written=8 12/03/07 21:14:52 INFO mapred.JobClient: FileSystemCounters 12/03/07 21:14:52 INFO mapred.JobClient: HDFS_BYTES_READ=235749450 12/03/07 21:14:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237613194 12/03/07 21:14:52 INFO mapred.JobClient: distcp 12/03/07 21:14:52 INFO mapred.JobClient: Files copied=1 12/03/07 21:14:52 INFO mapred.JobClient: Bytes copied=235749070 12/03/07 21:14:52 INFO mapred.JobClient: Bytes expected=235749070 12/03/07 21:14:52 INFO mapred.JobClient: Map-Reduce Framework 12/03/07 21:14:52 INFO mapred.JobClient: Map input records=1 12/03/07 21:14:52 INFO mapred.JobClient: Physical memory (bytes) snapshot=83595264 12/03/07 21:14:52 INFO mapred.JobClient: Spilled Records=0 12/03/07 21:14:52 INFO mapred.JobClient: CPU time spent (ms)=3120 12/03/07 21:14:52 INFO mapred.JobClient: Total committed heap usage (bytes)=124452864 12/03/07 21:14:52 INFO mapred.JobClient: Virtual memory (bytes) snapshot=558563328 12/03/07 21:14:52 INFO mapred.JobClient: Map input bytes=122 12/03/07 21:14:52 INFO mapred.JobClient: Map output records=0 12/03/07 21:14:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=158 hdfs2splunk.sh: Copied data to /tmp/hdfs2splunk/access_log.1 hdfs2splunk.sh: One Shotting to Splunk... hdfs2splunk.sh: done [boris@hiro hadoopish]$
This script leaves room for improvement, which I’ll leave as an exercise to the user. Some examples of these are:
Another cool thing is that if you want to distribute this data to a set of indexers, you can do this operation of a forwarder which is configured to auto load balance across the indexers.
I’d like to hear from you on ideas for improvements, errors/problems you see, etc.
Thanks to Amrit and Petter for answering my Splunk questions, and Sophy for editing.
I based using distcp on this blog post, http://blog.rapleaf.com/dev/2009/06/11/multiple-ways-of-copying-data-out-of-hdfs/
Happy Hadooping. Happy Splunking.
----------------------------------------------------
Thanks!
Boris Chen
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.