TIPS & TRICKS

Analyze Data with Hunk on Amazon EMR

In this post you will learn how to use Hunk to process data with an Amazon EMR cluster. We will go through the steps of:

  1. Creating a Hunk EC2 instance,
  2. Creating an Amazon EMR cluster
  3. Configure Hunk with EMR for the purposes of analyzing data in an S3 bucket.

** SECURITY NOTE** Before we start, a quick but very important note about network security You need to make sure that the Hunk instance can freely communicate (i.e. traffic allowed to and from all ports) with ALL EMR cluster nodes; master and slaves. Please edit Security Groups in EC2 Management page to account for this requirement.

Create a Hunk instance on AWS EC2.

The most convenient way to create an EC2 instance with Hunk is to use the Hunk AMI directly from AWS Marketplace (https://aws.amazon.com/marketplace/pp/B00GIZK2QI). The AMI is public and free to use, although typical EC2 hourly fees apply. It includes Hunk installed, the Hunk installer package (which will be needed later to distribute to DataNodes), Hadoop libraries, as well as Java – all in a Linux x64 base.

Screen Shot 2013-11-11 at 10.37.45 AM

Build an instance with enough resources that fits yours needs and satisfies your requirements. I would recommend a minimum of m1.xlarge.

Screen Shot 2013-11-11 at 10.42.14 AM

Proceed with the rest of the setup screens until you have the EC2 instance up and running. You are not necessarily required to have extra storage added to the instance, but if you would like, feel free to add according to your needs. Also, make sure you select and note the key pair name to connect to the instance.

Connect to the Hunk instance

To connect to the Hunk instance you just provisioned, open a terminal and ssh via to the Public DNS address.

Screen Shot 2013-11-11 at 11.14.56 AM

$ ssh –i my_key.pem ec2-user@<public-dns-address>

Navigate to /opt and note the directory layout:

Screen Shot 2013-11-11 at 11.20.32 AM

Brief description:

/opt/hadoop contains the Hadoop libraries. For now, only vanilla Apache hadoop 1.0.3 and 2.2.0 are located here.

/opt/java is where a modern version of Java resides. The latest installed in the AMI is Java 1.7 U45.

/opt/splunk contains the actual Hunk installation.

/opt/splunk_packages is where tar.gz Hunk install bits reside. The current package is: splunk-6.0-184175-Linux-x86_64.tgz

To make it easy to interact with EMR (i.e. read from HDFS/S3n and run MR jobs) all the above directories are recursively owned by user and group hadoop.

 

Start Hunk

To start Hunk run the following commands from your SSH window:

Screen Shot 2013-11-11 at 11.27.46 AM

Go through the agreement steps and note the port where Hunk is running; default is 8000.

Point your browser of choice to the Public DNS URL and login:

http://<public-dns-address>:8000

Default credentials are: admin/changeme You will be asked to change the password and afterwards you will be presented with the classic Splunk 6 interface:

Screen Shot 2013-11-11 at 11.34.39 AM

 

Create an EMR cluster

There are at least two ways to create an AWS EMR cluster; via AWS Console or using EMR Tools through the command line. Since I have tools installed I will launch the cluster using this latter method. In order to provide analytics and insights on data on Hadoop, Hunk does not need or utilize any other applications such as Hive or Pig. Therefore, if you’re creating a cluster from the AWS Console, you can simply de-select them, as we will only need an interactive EMR cluster.  Enter a cluster name and select your desired logging and debugging options.

Software: Choose the Amazon Hadoop distribution, with the latest AMI version: 2.4.2 (Hadoop 1.0.3) – latest

Hardware: Select an m1.medium for Master, count=1, and m1.xlarge for Core/Slaves, count=3.

Security and Access: Make your appropriate selections here. I chose to select the same key-pair as my EC2 instance above.

Bootstrap Actions and Steps: Make your own selections here. I chose not to have any.

The equivalent of this from command line is:

$ ./elastic-mapreduce --create --alive --name "my_hunk_emr_cluster" --ami-version latest --master-instance-type m1.medium --slave-instance-type m1.xlarge --num-instances 4 --key-pair my_key

This command creates an EMR cluster named “my_hunk_emr_cluster” off of the “latest” AMI with three slave m1.xlarge nodes and one m1.medium master node.

Screen Shot 2013-11-11 at 1.56.05 PM

Configure Hunk with EMR cluster and S3n bucket

Hunk is able to work with data in both HDFS and S3. In this case we’re working with the assumption that data resides in S3n (native), although, “local” HDFS is much better performance-wise.

Let’s now configure Hunk with our freshly created EMR cluster: while logged in in Hunk, go to Settings, Virtual Indexes and click on New Provider. Enter a Name of your liking. For Java home and Hadoop home you can use the ones below. Modify the Job Tracker, File System, and HDFS Working Directory to correspond to your Master address and your S3 bucket respectively:

Name:                           my-emr-provider

Java Home:                 /opt/java/latest

Hadoop Home:           /opt/hadoop/apache/hadoop-1.0.3

Hadoop Version:        Hadoop 1.x (MRv1)

Job Tracker:               <internal master ip>:9001

File System:                s3n://<AWS Access Key>:<AWS Secret>@<bucket name>

HDFS working dir:    /working-dir (in my case this a folder at the root of the bucket above)

Screen Shot 2013-11-11 at 2.28.59 PM

Add a new setting, at the bottom, to tell Hunk what package to distribute to DataNodes:

vix.splunk.setup.package: /opt/splunk_packages/splunk-6.0-184175-Linux-x86_64.tgz

Screen Shot 2013-11-11 at 2.27.59 PM

Save the configuration, and proceed to create a new virtual index. In our case, we’re naming the index emr-index and configuring it to read the Apache web server logs.

Screen Shot 2013-11-11 at 2.32.38 PM

 

Logs reside in a folder called logs in the base of our bucket and they are compressed in .tar.gz format.

Screen Shot 2013-11-11 at 2.34.44 PM

Click Save and return to the Search app.

In the search bar enter “index=emr-index | head 10” and observe events streaming from our bucket .

Screen Shot 2013-11-11 at 2.39.52 PM

There are not many interesting fields extracted by default from our events. So, let’s add access-extractions to our source. This configuration will apply field extractions for logs in Apache access combined format. Go to Settings, Fields and create a New extraction:

Screen Shot 2013-11-11 at 2.57.53 PM

Note that source named field reads exactly /logs/access… with the ellipsis (…) indicating recursiveness as per here: http://docs.splunk.com/Documentation/Splunk/latest/Admin/propsconf. Change accordingly to fit your path. Save.

Return to the search bar and run the same search again. Note the additional fields on the left hand side of the screen.

Screen Shot 2013-11-11 at 3.01.42 PM

 

Create a Dashboard with two panels

Let’s assume we need to see top clients and an overall chart of traffic over time. For this we will need two searches that will power two panel in our dashboard below:

(1) Top clientip search: index=emr-index | top clientip

Screen Shot 2013-11-11 at 3.13.43 PM

(2) Timechart Search: index=emr-index | timechart count

Screen Shot 2013-11-11 at 3.16.46 PM

Adding these two searches to a dashboard (clicking Save As > Dashboard Panel) produces a simple, two-panel dashboard that looks like this:

Screen Shot 2013-11-11 at 3.20.26 PM

For more information on Hunk, and Splunk in general, please visit their corresponding Docs sites:

Hunk:              http://docs.splunk.com/Documentation/Hunk/latest/Hunk/MeetHunk

Splunk:           http://docs.splunk.com/Documentation/Splunk

 

 

 

 

----------------------------------------------------
Thanks!
Dritan Bitincka

Splunk
Posted by

Splunk

Join the Discussion