Hunk, HDFS, and Indexes

Update 9/27/16: As of Sept. 27, 2016, Hunk functionality has been incorporated into the Splunk Analytics for Hadoop Add-On and Splunk Enterprise versions 6.5 and later.

I’ve been asked a number of times why Hunk does not create a physical index like Splunk.

First, let me point out that your Hunk instance can search both physical and virtual indexes, allowing you to correlate data from disparate sources and stores within your farm without incurring the cost of duplication.

Now back to the question, which should really be: why can’t a physical index be created in HDFS?

HDFS is a non-POSIX filesystem. In layman’s term, a POSIX file-system is one that can be written to and read from in real-time. One of HDFS shortcomings is that data is not persisted until the file is closed. Therefore, you cannot read data that “you think” has been written until the file is closed. To get around this limitation, some applications are designed to write in short burst, close the file, reopen and append to it.  But, should there be some interruption (network glitch, power outage …) before closing the file, any data thought to be written would be lost forever unless you can replay it from source. This is primarily why other solutions that create physical indexes from HDFS data leverage the local filesystem.

In order to support continuous writes, a complete re-write of HDFS would be required. So far, there has been no significant effort in that direction by the community.

Hence, if you require realtime alerting and monitoring with Splunk, use Splunk Enterprise; and if you wish to speed up your HDFS reports, use our report acceleration feature.

Julian Andre

Posted by