TIPS & TRICKS

Hunk troubleshoots Hadoop

It’s that time again, watching customers use Hunk 6.1!! Most users are able to get Hunk up and start analyzing their Hadoop data in less than an hour. However, some of the users run into issues while trying to connect Hunk with their Hadoop cluster. In this post I will focus on how you can use Hunk to troubleshoot itself and other apps running on their Hadoop 2.x cluster.

The problem
Most first time users download Hunk and try to get it up and running against one the VMs provider by Hadoop vendors. While these VMs are extremely convenient they are very limited in the resources and thus Hadoop configurations are tuned appropriately. The low resource allocations can sometime cause issues for Hunk while it tries to execute Yarn apps in Hadoop. In many cases the error messages exposed by Hadoop (and thus Hunk) are very generic and they don’t necessarily lead to the root cause of the problem. You can look at some examples here , here and here. In many of those user issues the vast majority of the time was spent hunting down logs to find the correct stacktrace.

Some background info
Hadoop 2.x introduced log aggregation. This is a great feature intended to help application developers, Hadoop admins and Hadoop users get access to the application logs directly on HDFS without having to go digging into Hadoop nodes. While the promise of this feature is great there are quite a number of hurdles for users to get to those logs, one of the big drawbacks of the aggregated logs is that they’re not in a plain text/original format, but rather they’re wrapped into an archive like file – you can read more about other drawbacks on my Hadoop 2.0 Rant

The solution
Splunk & Hunk are two great products for dealing with machine data, so it is quite a natural fit for using either of them to look into these files and perform root cause analysis. So, wouldn’t it be cool if you could use Hunk to troubleshoot the environment where you’re trying to set it up (without having it fully setup yet)? While that might sound self-contradictory, not only is it cool but also very much possible. The solution is made possible by two components of Hunk that were present since it’s inception:

Mixed mode search enables Hunk to stream data from HDFS and process it locally (similarly to a non-local Map task), while the data preprocessing framework allows us to plugin classes that handle parsing of data that Hunk does not natively understand. Therefore, the two requirements for this solution to work are:

  • Hunk has been properly configured to at least access HDFS and
  • log aggregation is enabled

Configure a virtual index
In order for you to access the Yarn logs through Hunk (while setting it up or afterwards) you need to set up a virtual index that points to the location where the logs are being aggregated into and specify the appropriate data preprocessor. Below you’ll see an example for how such a virtual index should look like:

$SPLUNK_HOME/etc/app/search/local/indexes.conf
...
[yarn-logs]
vix.provider = <name-of-your-provider>
vix.input.1.path = /tmp/logs/...
vix.input.1.recordreader = com.splunk.mr.proto.YarnLogsRecordReader

Start troubleshooting
Immediately after creating the virtual index you should be able to search through the Yarn logs. For example, while trying to setup Hunk against Hadoop 2.x whenever I tried to run searches that would spawn an MR job, the search would fail with the following error message:

05-14-2014 15:53:07.875 ERROR ERP.yarn-ha -  java.io.IOException: Error while waiting for MapReduce job to complete, job_id=job_1389379145170_0321, state=FAILED, reason=Application application_1389379145170_0321 failed 1 times due to AM Container for appattempt_1389379145170_0321_000001 exited with  exitCode: 1 due to: 
05-14-2014 15:53:07.875 ERROR ERP.yarn-ha -  .Failing this attempt.. Failing the application.
05-14-2014 15:53:07.875 ERROR ERP.yarn-ha -  	at 
....

That error message is pretty generic, it simply tells us that we failed to bring up the AM container – but why did the container exit with exitCode 1?

Enter the magic of Hunk and search the yarn-logs, index we just created, for the particular application that failed and it immediately becomes apparent that this is a classpath issue – root cause found!!

yarn-logs-search

Note, experienced SPL users might ask: why not simply run “index=yarn-logs source=*/application_1389379145170_0321/*” instead of using a filtering search? The reason is that if we add any filtering predicates to the first search it would cause Hunk to spawn an MR job, which is our original problem.

After troubleshooting Hunk and setting it up, you can continue to use the yarn-logs virtual index to troubleshoot issues with any other application running in your cluster – thus taking full advantage of Hadoop’s log aggregation feature!

----------------------------------------------------
Thanks!
Ledion Bitincka

Splunk
Posted by

Splunk

Join the Discussion