At the end of October, Splunk announced the release of new product called Hunk: Splunk Analytics for Hadoop. Once you get over the awesome name, you realize how much of a game-changer it is to give individuals across the organization the ability to interactively explore, analyze and visualize data stored natively in Hadoop.
(Admittedly that sounds like marketing, but in this case it’s also true.)
At the recent Strata Hadoop World conference, the Cisco Labs team showcased best practices for rapidly deploying Big Data clusters with predictable performance and massive scale using Cisco Nexus, UCS and other tools – including Splunk Enterprise and Hunk.
The Cisco Labs team is tasked with evaluating the infrastructure challenges associated with Hadoop deployments and establishing industry best practices to address them. The network environment supporting Hadoop is particularly complex. For example, you may see that one port has no utilization while others are over-utilized during a job. To prioritize schedule jobs efficiently, you need to know which jobs are running during peak buffer usage and integrate the usage data with CPU data and network utilization data, all in real-time.
Originally the team manually compiled and charted this information, which took hours to gather and wasn’t anything close to real-time. To streamline and automate this process, the Cisco team thought of Splunk. They invited me to help them integrate Splunk Enterprise and the beta version of Hunk within their infrastructure.
We started off by installing the Splunk Hadoop Ops App to ingest data from their 16 node Hadoop Cluster. The cluster was running Cloudera CDH, so the app worked right out of the box.
As you can see below, we used Splunk to view progress of the jobs as well as the CPU load produced. We collected performance data from the new network switches by running custom python scripts that gave us buffer usage on regular intervals. Individual port utilization was collected via another script, the output of which was the cumulative bytes transferred.
The end goal was to see how different input variables would affect the runtime of jobs. The variables we could change were disk speed (SSD vs spinning disks) and network speed (10Gb vs 40Gb).
All the data collated can be written back to the Hadoop cluster, enabling us to use Hunk to create a series of reports that show the results.
The team not only eliminated hours of manual effort and gained real-time visibility into their environment, they were also able to deliver a pretty impressive overall presentation at Strata Hadoop World that got a lot of people thinking about streamlined Hadoop infrastructures.
You can check out the scripts used and other examples at Cisco’s Datacenter github.