One of the questions I am often asked is what is the difference in storage between Splunk Enterprise and Hunk on Hadoop using Hunk archiving. Customers are trying to drive down TCO by storing historical data in Hadoop since it can run on low-cost commodity hardware. Hunk provides a simple mechanism to archive data from Splunk Enterprise into HDFS. Any data in warm, cold or frozen buckets can be archived and offloaded from Splunk instead of being deleted. The best part of the archiving functionality is that as soon as the data is copied over to Hadoop it is available for searching from Hunk straight away using the same SPL language you know and love. Here is a great blog post describing the technical details on how this works – Splunk archiving with Hunk
Step 1 – Stand-up some infrastructure
To do this test I needed three things – a Splunk Enterprise cluster, a Hadoop cluster and data! AWS was my infrastructure of choice for this test. This certainly made it faster to get the clusters up and running.
Step 2 – Generate some data!!!!!
To generate the data I used SplunkIt (https://splunkbase.splunk.com/app/749/) which is is a performance benchmark kit designed to provide a simplified set of performance measurements for Splunk. One of the convenient features is the generate data function produces a 50GB syslog file. I ran this twice to generate 100GB of data, whilst the data is not perfect and has a relatively low cardinality it certainly meets the need for this test. What does this data look like?
Step 3 – Index it into Splunk Enterprise
I could have just added this into the previous section but I wanted to call out the amazing new indexer discovery feature we added to the forwarders in 6.3. I just needed to add a section to the server.conf in the cluster master to let it know that I want to use indexer discovery.
pass4SymmKey = my_secret
Then add the following part to outputs.conf on the forwarders
pass4SymmKey = my_secret
master_uri = https://10.152.31.202:8089
autoLBFrequency = 30
forceTimebasedAutoLB = true
indexerDiscovery = master1
[tcpout]defaultGroup = group1
Hey presto! My forwarder is made aware of the 5 Splunk peers in my cluster and is ready to load all of the data into Splunk. After a little while (and after remembering to set limits.conf maxKbps to 0) I ended up with 343,997,339 events in my index foo in my Splunk cluster.
What does this look like on disk space for my index foo?
- Peer 1 – 15GB
- Peer 2 – 16GB
- Peer 3 – 23GB
- Peer 4 – 33GB
- Peer 5 – 33GB
Total size of foo index on disk across all indexers = 120GB
How does this compare to the estimate for 100GB of data per day? Well our Splunk sizing estimator came out at 110GB of storage for this scenario – so pretty close (https://splunk-sizing.appspot.com/#ar=0&c=1&cr=0&hwr=1&i=5&v=100 )
Step 4 – Over to Hadoop
Next step – setup the archiving. Hunk brings a great feature to select an index and archive it to Hadoop. This was setup to send all of the data over to Hadoop. Taking into account the replication factor of 3 the archived data from Splunk takes up 28GB in total disk space – quite a good saving in space. 120GB down to 28GB.
Check that we have the same search results!
Great – we have exactly the same results in Splunk Enterprise and the archived data in Hunk note that this the same as we have copy the data to HDFS and leave the original in Enterprise under its original life cycle management. Now we have the best of both worlds, data in Splunk Enterprise for fast interactive searching and Hunk for running dense searches over you archived data.
What are you waiting for – don’t send that archive data to tape archive, use Hunk to keep that archived data alive.