As I mentioned in my last blog, archiving for big data is important. If you haven’t already, please read it before going on. If you have already read it, read it again. It’s important.
Are you back? OK.
Now, as I mentioned, archiving has pitfalls and challenges, and people typically custom script solutions for it. Here’s a recap of some of the challenges of a good archiving solution:
- Data loss
- Organizing data
- Pluggable backend support
- Search for what’s been archived
- Selective “thaw” of frozen buckets
- Flushing of thawed buckets
So, how do you meet these challenges? I’m so glad you asked.
Shuttl is an open source product that works with Splunk to reliably archive your Splunk data to another system and lets you restore on demand. Let’s go through the ways it addresses each of the challenges mentioned above:
1 – Reliability – As I mentioned, you can lose data if your archiving script fails to transfer the data to the storage destination, since Splunk will delete upon the roll. Shuttl avoids this situation by doing a local move to a temp location, so if the transfer fails, you can restart it against this temp location. The temp file is only deleted when the transfer is successful. Also, since the transfer is asynchronous to the data indexing of Splunk, your indexing is not affected if the back-end archiving becomes unavailable or slow.
2 – Organization – Shuttl will organize all your Splunk buckets into a human-navigable structure according to your deployment hierarchy. The benefit here is that you can trace which index the bucket is for, which node in a distributed instance (termed “cluster”) generated it, and which cluster it was from. The organization makes it easy to reload by index, by server, or by cluster. It also enables you to set up mirror instances to use solely for analyzing historical data.
3 – Pluggability – Many people archive to attached storage. However, there are emerging technologies, like HDFS, and S3, that are increasingly used for archiving. More, different solutions will exist in the future, so the Shuttl abstracts this, and allows for you to make new implementations as needed.
4 – Retrieval – A big archiving challenge is in restoring things that have been archived. You have to find what you want, and selectively restore it. Once you are done using it, you want to discard it (it’s already in the archive, so no need to retransfer it back). Shuttl does all these things. It allows you to find buckets that have been archived, select what you want, use it, and discard once it is no longer needed. You can do this in the same live production instance, or on a mirror instance used just for analysis.
Bonus – Multiple File Formats
In addition, Shuttl supports two different formats for archiving: CSV, and Splunk bucket.
Splunk bucket – This is the native binary format used by Splunk for persistent data. This format can be moved back and forth, seamlessly. It contains both the processed raw data that is compressed, as well as index files to enable search. Since it is in the native format, restores are very fast with little compute overhead. Though this is the optimal format for Splunk, it does not facilitate data interchange or downstream usage of that data by other data systems.
CSV – This is the Splunk raw data in csv format without the index files enabling search. This allows other data systems to reuse the raw data for subsequent analysis. In addition, when in this format, it can STILL be restored to a Splunk index (without incurring a charge against your daily indexing quota). The only drawback is that the Splunk index will get regenerated, which incurs compute cycles during restoration of the data to be searchable.
You can configure Shuttl to archive in either format. (we’re working on supporting both at once)
But wait, there’s more!
And on top of this, Shuttl is open source under the Apache License, so you can modify it for your own unique purposes in accordance with the license, or optimally, contribute back to the project to make Shuttl even better.
One anticipated Shuttl use case is archiving data from Splunk in HDFS. For many customers, being able to move data to HDFS and still have search access to it is a critical requirement.
Shuttl is packaged as a Splunk “app,” which means that Splunk will start up the Shuttl process which does all the coordination between Splunk and HDFS (and shut it down upon exit). The process handles both shuttling data from Splunk to HDFS and shuttling the data back. (yes, hence the name)
To use Shuttl, you should be familiar with how Splunk archiving works, as well as how HDFS works. If you want to work on the code, you should have Java skillz. The code has extensive unit tests, but as with all open source code (or all code period!), test, test, test to make sure it works for you. The paint is still drying on this, so beware.
Acknowledgements and Availability
Shuttl is a project primarily developed by Petter Eriksson (@petterik_) and Emre Berge Ergenekon (@emreberge), with additional contributions by Anton Jonsson, André Eriksson, Kiru Pakkirisamy, Allan Yan, and others. (Thanks to Rachel Perkins for the blog editing) The core functionality of archiving is featured in the masters thesis project Petter and Emre are doing at The Royal Institute of Technology in Stockholm, Sweden.
The code is currently “community supported.” Please don’t call Splunk Support to ask about it. This is not an official product, it is simply an open source project that we want to share. You can pull the code off of GitHub, or you can download the app on Splunkbase. If you try it out, email us at shuttl-dev at splunk dot com to ask questions, give comments, or contact any of the authors. (or email me at boris at splunk dot com with “shuttl” in the subject line)