Connecting Splunk and Hadoop

Finally I am getting a some time to write about some cool features of one the projects that I’ve been working on – Splunk Hadoop Connect . This app is our first step in integrating Splunk and Hadoop. In this post I will cover three tips on how this app can help you, all of them are based on the new search command included in the app: hdfs. Before diving into the tips I would encourage that you download, install and configure the app first. I’ve also put together two screencast videos to walk you through the installation process:

Installation and Configuration for Hadoop Connect
Kerberos Configuration
You can also find the full documentation for the app here

The search command: hdfs
This new search command is capable of reading data and metadata from HDFS at search time – which means no license costs, FREE!!! The general syntax is as follows:
hdfs <directive> <directive-options>

Currently we provide three directives: ls, lsr and read, for listing a directory/file, recursively listing a directory and reading the contents of a file respectively.

With the basics out of the way, let’s move on to the interesting stuff ….

Hadoop based lookups
You’ve probably heard that Hadoop is a batch processing system and as such it is not well suited for low latency or interactive applications. Splunk lookups are one such application, after all lookups are nothing more than a join of Splunk results with some external information. In order for lookups to be efficient they need low latency response from external sources of information. So, how can we marry these two? Answer: hdfs search command! Imagine we have a process which generates a lookup file in HDFS in a particular path, say /home/webapp/user_info.tsv and it contains the following fields: user, first_time, last_time, purchase_count in a tab-delimited format. We can use the following Splunk search to bring this lookup into Splunk where it can be used to efficiently enrich Splunk results

| hdfs read hdfs:// delim="\t" fields="user,first_time,last_time,purchase_count" | outputlookup userinfo

Given that this is just a regular Splunk search, you can schedule the search and it will periodically pull the lookup into Splunk – just make sure that period matches that of the process that is updating the lookup info in HDFS. Also, remember that after the hdfs command you can use any Splunk command to update/filter/enrich the lookup information before outputing it (read on for some examples).

MapReduce result visualization
Imagine this scenario: you have some data in HDFS, you’ve written some magical MapReduce jobs for crunching this data and produce some result files in HDFS. You also have some data in Splunk and you’ve written some searches that process this data and visualize the results as charts/tables/etc and you’ve managed to create some dashboards that everyone loves :)

Now, wouldn’t it be cool if you were able to visualize your MapReduce results in the same way as your Splunk searches?

Good news! You can! The hdfssearch command makes it trivial! To keep up with earlier example, imagine you have done some analysis that computes the likelihood of a user returning to your web store for all your users and you want to see the probability profile of the 100 users that are most likely to return. Here’s the Splunk search that visualizes your data.

| hdfs read hdfs:// delim="\t" fields="user,return_prob" | sort - return_prob | head 100 | chart sum(return_prob) as return_prob by user

As I mentioned in the previous section after the first hdfs search command you can use any Splunk search command to manipulate the results as you wish. Now, given that this is just a regular Splunk search, it means that you can show it’s results in a dashboard panel just like you can with regular Splunk searches. You can also create a dashboard that contains some panels powered by Splunk searches and some populated with data from HDFS. Isn’t that cool?

HDFS usage accounting
You’ve got Hadoop, you’ve got HDFS and everyone in your company is trying to get a piece of it :), but some users/groups can be greedy and consume a lot more resources than what you want. So, how do you account disk usage by user/group for your Hadoop cluster(s)? You guessed it, hdfs search command. Here are some cool statistics that you can easily obtain by using the hdfs search command and its lsr directive

show 10 users that are consuming the most disk

| hdfs lsr hdfs:// | stats sum(size) as total by user | sort 10 -total

show 10 users with the largest number of files

| hdfs lsr hdfs:// | search type=file | stats count as total by user | sort 10 -total

show 10 largest files and their owners

| hdfs lsr hdfs:// | search type=file | sort 10 -size | table size, user, path

I hope you’ve enjoyed these tips and I encourage you stay tuned for more. Feel free to comment below with your own use cases.

Ledion Bitincka

Posted by