Introducing Shep

These are exciting times at Splunk, and for Big Data. During the 2011 Hadoop World, we announced our initiative to combine Splunk and Hadoop in a new offering. The heart of this new offering is an open source component called Shep. Shep is what will enable seamless two-way data-flow across the the systems, as well as opening up two-way compute operations across data residing in both systems.

Use Cases

The thing that intrigues us most is the synergy between Splunk and Hadoop. The ways to integrate are numerous, and as the field evolves and the project progresses, we can see more and more opportunities to provide powerful solutions to common problems.

Many of our customers are indexing terabytes per day, and have also spun up Hadoop initiatives in other parts of the business. Splunk integration with Hadoop is part of a broader goal at Splunk to break down barriers to data-silos, and open them up to availability across the enterprise, no matter what the source. To itemize some categories we’re focused on, listed here are some key use cases:

  • Query both Splunk and Hadoop data, using Splunk as a “single-pane-of-glass”
  • Data transformation utilizing Splunk search commands
  • Real-time analytics of data streams going to mutliple destinations
  • Splunk as data warehouse/marts for targeted exploration of HDFS data
  • Data acquisition from logs and apis via Splunk Universal Forwarder

From these, we derive some key features.


There are several building blocks to make the above happen. We’re starting with the basic API and connectivity integration as a foundation to seamless dataflow and co-compute scenarios, so one can use Hadoop when necessary, but use Splunk where speed, simplicity, and power are necessary.

Specific features that are currently in beta:

  1. Input/Output Format Classes, so that Map Reduce jobs in Hadoop can operate on data in Splunk
  2. Splunk “bucket” mover, for configurable rolling of data from Splunk to HDFS
  3. Real-time streaming from Splunk to HDFS

Features under active development:

  1. Search language integration with Hadoop
  2. Monitoring of files and directories in HDFS for Splunk Indexing

Hadoop users know that Map Reduce and HDFS is only the foundation of Hadoop, and not the whole picture. For complete solutions, other components need to be employed. Even with other components in the Hadoop ecosystem, such as Hive, HBase, Flume, and Pig, the gaps are still larger than what’s filled in. For that, Splunk can bridge many of those gaps. Splunk has the benefit of being battle-proven on the frontlines of the datacenter, and our customers have been asking for how they can leverage Splunk in conjunction with concurrent projects on Hadoop in the enterprise, to simplify the current potpourri of open source components of varying degrees of maturity, complexity, and stability. We’ve taken the first step in that direction.

Hadoop Components

Shep is Beta!

Shep is currently in private beta available on both Splunkbase and Github. To sign up, go here: . Since our announcement, we’ve gotten tremendous response, and are incrementally adding people as we can, so if you’ve registered, please be patient (we’ll get to you soon). And when we get critical mass, and Shep progresses to a certain point, we’ll be moving to a public beta, so stay tuned!


Splunk Enterprise with Hadoop Architecture

Boris Chen

Posted by