TIPS & TRICKS

Hadoop and Splunk Use cases

Customer Examples – Using both Splunk and Hadoop

The Splunk and Hadoop communities can benefit from each other’s strengths. Below are several examples of customers that use both environments.

Use Case Description
1 – Splunk then Hadoop Splunk collects, visualizes, and analyzes the data and passes it to Hadoop for ETL and other batch processing
2 – Hadoop then Splunk Hadoop Collects the Data, and passes the results to Splunk for Visualization
3 – Data flows in both directions Splunk and Hadoop collect different artifacts and share the data that Hadoop needs for ETL or batch analytics and Splunk needs for real-time analysis and visualization
4 – Side-by-Side Both Splunk and Hadoop are used by the organization, but are used for different use cases and there is no integration
5 – Splunk monitors Hadoop Splunk’s core capabilities around monitoring IT infrastructure
 

1 – Splunk then Hadoop

flowchart data sources Hadoop Splunk

A discounted gift certificates Company:

Splunk is a primary tool used by this company for making use of big data and gaining real-time operational intelligence from their infrastructure. They use Splunk for application management, security, performance management, analytics for their public APIs, and for funneling analysis of click-through rates in order to understand and optimize placements of ads.

Teradata also uses part of the data that Splunk is collecting and visualizing. To prepare the unstructured data for Teradata, this discounted gift certificates company is using Hadoop Hive for Extract Transform and Load (ETL).

The discounted gift certificates company collects and indexes massive streams of machine data in real-time using Splunk forwarders and indexers. It then allows multiple users across the organization the ability to search, analyze and visualize the data. They then sends subset of the raw events in a reliable, predictable way to HDFS. In Hadoop they run Hive queries to transform the data into a format that Teradata can consume.

flowchart Splunk Hadoop Hive Teradata

 

2 – Hadoop then Splunk

 

flowchart data sources Splunk Hadoop

Hadoop Collects the Data, and passes the results to Splunk for Visualization

Large Cable Company:

The large cable company has many different Splunk use cases.  One of their use cases involves taking data from the set-top boxes to gain insights in to customer interaction with content served up by the set top box.  Each set top box has a media access control (MAC) address that is unique and is associated with a specific customer.  The set top box captures all customer interaction with the device including what content the customer searched for, what the date of search was, what search results were displayed (this information is recorded a unique identifiers called IDA numbers) and what content was purchased.

Some of the use cases that are generated from the this company devices are caller ID, metadata distribution, STB menus, and menu entitlements.

Hadoop consumes this high volume of data from many systems. After collecting and refining the data into readable logs, they imports the data from Hadoop into Splunk indexers. Splunk address Hadoop limitations, such as the lack of visualizations, and the need for data scientists and specialists to analyze data or to write MapReduce code. Splunk allows the large cable company to visualize and expose the data to many users. Therefore, Splunk drives operational intelligence, improves user experience, troubleshoots root cause analysis, tracks and measures success, generates reports, and generates alarms based on the data collected by Hadoop.

diagram Hadoop data Splunk visualization report

 

3 – Data flows in both directions

 

diagram Hadoop Splunk integrated

Splunk and Hadoop collect different artifacts and share the data that Hadoop needs for ETL or batch analytics and Splunk needs for real-time analysis and visualization

Online Travel Company:

Splunk monitors over 98% of this online travel company infrastructure, which includes over 11,000 servers sending data to Splunk. Splunk is used for application monitoring, infrastructure management, and web analytics. Over 2,700 users at this organization use Splunk to gain real-time insights of not only their IT infrastructure, but also online bookings, performance of air-travel coupons and optimizing SEM. To handle an additional 90TB of application transaction data Per month, the Online Travel company integrated Splunk with Cassandra and Hadoop. This bi-directional integration allows for both exporting of events from Splunk to Hadoop and Cassandra for storage, as well as importing of data from these systems back to Splunk. Using Splunk as the center of the integration and deep integration allows over 2,700 users a visually easy access to massive amounts of data.

flowchart Splunk bidirectional data flow

 

4 – Side-by-Side

 

diagram Hadoop Splunk separate

Both Splunk and Hadoop are used by the organization, but are used for different use cases and there is no integration

A customer relationship management (CRM) Company:

A CRM company is using Splunk to gain insight into customer usage for their Collaboration Cloud including Chatter.  They are taking all relevant application and web logs for Chatter usage and indexing them in Splunk.  Using these insights, the product managers can gain visibility into every customer interaction on the site.  They can work with operations to identify and take proactive action to resolve problems with specific features.  They can understand customer usage of new product features and capabilities.  Dashboards and graphs illustrate in real-time how feature usage is trending relative to the baseline.

This CRM company is also using Hadoop for use cases like Extract Trasform and Load (ETL) jobs and batch processing.  However, currently there is no integration between the Hadoop jobs and the Splunk environment.

 

5 – Splunk monitors Hadoop

 

diagram Splunk monitoring Hadoop Ops

Splunk ability to monitor IT infrastructure extended to Hadoop

An Online Resource for Automotive Information Company:

Splunk core capabilities around monitoring IT infrastructureA specialized online resource for automotive information Company:

The automotive information company is using Splunk for operations management, security, and application troubleshooting. The data sources they ingest includes http logs, Apache, WebLogic, F5, syslog-ng, NFS infrastructure, Netscreen, Sourcefire (IPS), Cisco, Access Control Systems, Oracle RDBMs, and Hadoop. The Splunk Hadoop monitoring allows this company a single interface to search, monitor and analyze the full Hadoop environment including cluster resources beyond Hadoop itself, such as the network, operating system and database.

----------------------------------------------------
Thanks!
Raanan Dagan

Splunk
Posted by

Splunk