Shuttl is being featured at Splunk’s Worldwide Users’ Conference 2012. I’ve talked about the benefits of Shuttl for efficiently and scalably bulk-moving Splunk data to HDFS for Archiving in a past blog announcing its availability, and here I’ll expand on how it enables the emerging theme of Big Data Integration.
Big Data Integration
In the big data space, the diversity of technologies is not only huge, but fast changing. Every time I hear about a new technology, the first thing I think of is, “How will it integrate with other data technologies?”
Despite much of the discussion about big data having to do with volume, latency, scalability, availability, consistency, flexibility, etc. it seems only when real projects are going, do people spend a great deal of focus on how to stitch all these new technologies together, or more often than not, how to stitch the new technologies to the old.
Good ‘ol CSV
This isn’t a new problem of course. Many data integration categories exist, such as EII, EAI, and most prominently, ETL. In addition, many standards for data formats have come and gone in the decades since databases first came into use, and through it all, it’s amazing how one format endures. That format is CSV.
Splunk is no different. For years, Splunk has supported CSV exports of data via the “exporttool,” and it turns out that this opens up the data for many other uses in the era of Big Data. That’s why Shuttl currently supports two different export formats:
- Splunk Bucket – Native Indexed and Splunk-Searchable Data.
- Splunk Interchange Format (SpIF) – Annotated Raw Event Data as CSV.
The former is optimized to restore data into a Splunk instance for analysis, and the latter is a generic export format that other systems can consume (both new generation and legacy).
(Note: I’m trying out the term “SpIF” for size, it’s not an official part of the Splexicon, so let me know what you think of it)
Splunk Data Unleashed
This modest feature enables Splunk users to unlock data stored in Splunk for a plethora of additional use cases, since it allows other data systems to consume Splunked data for different needs.
For instance, in the Hadoop world, a prominent tool used for data warehousing is Hive. Just by the simple act of creating a SQL table that is based on CSV files that Splunk has exported, you’ve now opened up data collected by Splunk to a whole new class of use cases such as a warehouse system that can analyze all the Splunk-collected data with data exports from other systems, such as OLTP and web applications. You can also utilize ETL tools such as Pentaho’s Kettle to transform and merge the datasets for traditional OLAP analytics.
The way forward in big data turns out not to be throwing out all the old, for the sake of the new, but rather by synthesizing the legacy systems and techniques (ETL, CSV), with the new breed of big data technologies, such as Splunk and Hadoop.
Value Added Data
In addition, the value of the data indexed by Splunk and exported as SpIF is not just the data itself. Any Splunk user will know that Splunk processes the data so it’s useful for analysis. This includes:
- Event Separation
- Timestamp Extraction
- Auditing Information
- Source Information
- Sourcetype Information
- …and more I won’t list here
Several other bits of information are automatically extracted during the indexing process, but these are the most commonly known. Many things only exist in the native Splunk Bucket format, and is not accessible to the user other than via Search (aka SPL), however, many things still exist in SpIF that prove to be highly useful!
Anatomy of SpIF
Let’s dive into some specifics. SpIF has the following fields:
- _time – The timestamp of the event, in seconds since 1/1/1970 UTC
- source – the name of the file, stream, or other input from which a particular event originates
- host – An event’s host value is typically the hostname, IP address, or fully qualified domain name of the network host from which the event originated
- sourcetype – field specifies the format for the event. Splunk uses this field to determine how to format the incoming data stream into individual events
- _raw – The raw text of the event
- _meta – A space-separated list of metadata for an event
The _meta field is a topic of it’s own, but an example of what is contained is as follows:
"_indextime::1346790964 timestartpos::7 timeendpos::16 date_second::50 date_hour::18 date_minute::40 date_year::2012 date_month::may date_mday::10 date_wday::thursday date_zone::local punct::""___::_..._:_=\""//_:_\"",=\"".\"",=\"""""
You can see from this little snippet, there’s a rich set of metadata (in addition to standard fields) attached to each event that solves significant problems people encounter when they try to analyze raw log files directly without Splunk. All the log files have been normalized and annotated into a format that allows valuable analysis to occur without layer upon layer of regex and parsing hacks.
More info on these fields can be found here:
Big Data Big Integration
The world of Big Data is not just about data size, and not even about the data itself. It’s equally about the diversity of technologies being brought to bear on the problem, and how those technologies can work in conjunction to provide end-value. Splunk has a huge fan base for analytics for operational data, and enterprises can now extend that data to other lines of business, providing potentially 360 degree views into the entire enterprise.
Shuttl is but a miniscule step in the right direction (along with the upcoming Hadoop Connect product), but it’s illustrative of an emergence of Big Data Integration tools to solve a rising challenge as Big Data hits the mainstream.
SpIF output is highly useful, but not optimal for every need. There’s no reason why other export formats can’t be supported in Shuttl including Avro, XML, or JSON. Compressed formats can also be supported. What is most interesting to you? We’d like to hear. Email the Shuttl dev team at shuttl-dev at splunk.com
For more information on Shuttl, see the project on github: https://github.com/splunk/splunk-shuttl