Calling all data hoarders: Splunk is collecting lots of data for you–what are you doing with it all? Some of you are letting it “roll” to oblivion: “Who needs it? it’s just taking up space!” Some are keeping it all in the live system: “I want all my data, all the time!” Many are rolling it out to another backend system for storage: “I might need this later!“
If the last one is your response, then congratulations, you chose wisely. Here are 4 reasons you really want to keep that data somewhere–but not in your live system:
- Data value is perishable, keeping unnecessary data on the live system will slow you down.
- Deleting data limits your ability to derive downstream value for the future.
- Different teams want to have access to the same data but for different purposes.
- Data needs to be retained for compliance, but also for forensic analysis.
Let’s explore this in more detail. (note, if you are unfamiliar with how Splunk does archiving, please see: Archive Indexed Data in the Splunk Docs)
Splunk excels in analytics for real-time and near-time data. When you need to know now, the right solution to keep up with high velocity data streams is Splunk.
You do not want to slow down or hinder that processing and analysis by accumulating older data on your live Splunk instance. If it’s not needed, it’s only going to get in the way. A few people running an all-time search (just because they were too lazy to restrict the time range) can impact users who are relying on the responsiveness of the system.
So, if you don’t need it, move it!
If you delete your data, you are missing out on what that data can do for you. I am probably preaching to the choir, but value can be derived from large amounts of historical data by subsequent analysis. It’s the gift that keeps giving.
The key is how much you want to pay to keep it around. You may want to keep family photos–some can go on the coffee table, but the rest are fine in the attic for later generations to appreciate (or whenever the desire arises). Same with corporate data. You only want to keep selective data available, with the bulk of data stored out of the way, but still accessible.
One team may be collecting the data, but things should not stop there. Companies now realize that their data assets are not necessarily useful just for that specific team, but can also be extremely useful across other parts of the enterprise.
You can see this in Splunk’s bread and butter: log files. Originally only the concern of systems administrators, now companies use that log data for many things, including web site optimization, recommendation systems, capacity planning, customer retention, etc.
When you archive data, you not only keep it for the team that collected it, but also to multiply the value of that data via different teams deriving new insights from it. The data can be used for things that the original people collecting it never envisioned.
Compliance and Forensics/Auditing
Sometimes you need to do what “The Man” says. Often, you can’t delete the data, and must retain it for set periods. In this case, there’s no way to avoid it, and what you want is the most economical way to do this. However, as described earlier in this article, this is one of the cases where “The Man” might be helping you: the data being retained may well translate to new value.
In addition to the things mentioned above, it’s useful to retain your data so you can figure out what has happened at a given time. This is a different class of analytics: investigating an event in the past and using the data to determine what occurred.
But It’s Hard!
As mentioned in a past blog, you always need to take into account the perishability of your data. (see: http://blogs.splunk.com/2011/12/14/data-best-used-by) People using Splunk will define their archiving process based on that consideration. However, it’s not easy! You can set configuration files to roll buckets by size and time to “frozen”, but how the data is handled after that is entirely up to you. Let’s consider some of the pitfalls:
One is data transfer: It is up to you to transfer the data to a new location. When people implement archiving solutions, this is where much of the effort is placed. However, a sometimes overlooked fact is this. If there is a failure to copy, guess what happens to the original data on a cold to frozen roll event? The data is deleted. The data goes poof.
Another is taxonomy and organization: When you put the data at the destination, how do you organize it? Everyone has a different scheme. And if you choose poorly, it can be a management headache to sort it out.
In addition, when you want to bring the data back into the thawed state, you face two problems: One is how to find what you want, and the second is how to get rid of it once you are done. For each of these, it is similarly manual steps. Everything is left as an exercise to you.
The final thing is that there are a diverse number of options for where to archive the data to. Some will use attached storage, some will use HDFS, some will use S3, some will use scp operations to another device, etc. Scripts are often custom built for the specific IT environment, so a script used by one customer may not be useful for your needs, and a change in archive option requires rewriting the script.
To summarize, some common pitfalls are:
- Data loss (a write failure can cause data to be deleted without being archived)
- Organizing data is haphazard
- No pluggable backend support
- Search for what’s been archived
- Selective “thaw” of frozen buckets
- Flushing of buckets in thawed buckets
The challenges can be daunting or at the very least, less than confidence-inspiring. With so many people scripting their own thing to solve this problem, can there be a widely available solution so that everyone can handle archiving in a reliable, flexible, and open manner? This problem served to inspire a new solution, which I’ll cover in my next blog posting.
To be continued.