Data, Best Used By…

To state the obvious, “Big Data” is big. The deluge of data, has people talking about volume of data, which is understandable, but not as much attention has been paid to how the value of data can age. Instead, value is often actually not just about volume. It can also be thought of as perishable.


When we think about the perishability of data, we find all sorts of every-day examples around us. When we pick up daily newspaper, the headlines catch our attention. The value is in the recentness of the data. Why we call it “news.” A year old newspaper in comparison is usually useful for starting a fire, or lining a bird cage. The value is baked into the price. News websites have a similar issue. How often do you get recommended a related article, and click it, and realize it’s months old, and whatever it was covering, the situation has changed dramatically? The information still has value, but the value has diminished. In news, the value comes from being new. Similarly, even for general web search, Google has changed its search algorithms to weight more heavily recency.

Value can be increased by making up by volume, which is why if data is old (historical) the “bigger” it is, the better. From that, you derive new value. It’s why historians spend so much time in libraries and archives, to continue the example. However, that’s not to say that “value” derived from old vs new data are one in the same. They are also for different purposes, and the combination of the two is the best of both worlds. The most interesting news articles are often those that set the current events in the context of history. Similarly, history books are often prized for their timeliness to current events. Different types of value can be derived from data, based on recency.

That is the real situation companies are facing. The danger is building out solutions that don’t take perishability into account.

Splunk + Hadoop

That’s where the value of the marriage of Splunk and Hadoop makes for such a powerful combination. Hadoop’s architecture is fundamentally batch-oriented, meant for high-throughput and high-latency operations. Splunk is pipeline-oriented, and designed for high-throughput and low-latency. It explains why Splunk has such a fanbase in the datacenter. It excels at processing parallel streams of data at high velocity. Ops people need to know. When? Now!

Operational data is highly perishable.

Latency is a trade-off with Hadoop, people use it because of other reasons, whether it is storage, flexibility, scale, etc. – great where perishibility is not an issue. It’s one reason why there’s often so much attention at data conferences about reducing latency in Hadoop. There are solutions to making Hadoop “real-time”, and indeed it seems there’s an emerging industry to do just that, but the ability to do so is often a reflection of the people’s skill and brainpower in using (or working around) the technology, rather than something intrinsic in the technology itself.

Successful Deployments

Rather than working around Hadoop limitations, there’s an obvious opportunity to mesh Splunk technology into big data deployments to handle all aspects of data perishability. Splunk has a proven track-record of gathering time-series unstructured data from logs and APIs, and performing real real-time analytics and complex event processing on large streams. On top of that, the UI is optimized for rapid-fire iterative data analysis over real-time, near-time, and historical data.

Shep is the seamless integration for data-flow between Splunk and Hadoop which allows people to use the right tool for the right job at all stages of data processing. Rather than every problem looking like a nail to be whacked by a Hadoop hammer, you in fact, can have at hand, a complete solution that takes into account data perishibility.

Stay tuned!

Splunk Hadoop Beta Program Sign-up:

Boris Chen

Posted by