Retain More Data at Lower Cost with New AWS Storage Volume Types

This is a guest post contributed by David Potes, Partner Solutions Architect at Amazon Web Services

Many of the customers I work with are being asked to retain more of their logging data for longer periods. Some of these customers are being driven by increasing compliance requirements, while others want to mine historical data to analyze their systems. With the recent release of Splunk Enterprise 6.4, you now have the ability to dramatically shrink the size of your indexes to retain data longer at reduced storage sizes. If you’re running Splunk Enterprise on Amazon Elastic Compute Cloud (Amazon EC2), you can also tier your Splunk storage to deliver the right performance, at the right price. Now, by using the new Amazon Elastic Block Store (Amazon EBS) storage types, you have even more opportunities to effectively tier your Splunk storage on AWS.

2000px-AmazonWebservices_Logo.svgThe Throughput Optimized HDD (ST1) and Cold HDD (SC1) volumes introduced today are designed to support customers with large, streaming datasets that require consistent throughput at a low price point, such as Amazon Elastic MapReduce (EMR), ETL, log processing, and data warehouse workloads. Now, by using the new Amazon EBS storage types, along with the new storage reduction feature announced in Splunk Enterprise 6.4, you have even more opportunities to effectively tier your Splunk storage on AWS.

When I talk to customers, I generally see two common approaches to optimizing index storage when running Splunk on AWS:

  • Run Amazon EC2 D2 instances to take advantage of the dense ephemeral storage available on these instance types. The Amazon EC2 D2 instance type can store up 24 TB on ephemeral instance storage in a RAID 10 configuration.
  • Run Amazon EC2 C4 instances to take advantage of the compute optimized instances and store data on Amazon EBS. The Amazon EC2 C4 is only Amazon EBS, so all of the data will be stored there.

Depending on the rate of ingest, the data retention requirements and the relative complexity of the ingested data, I recommend modeling the costs of each approach to best balance between performance and cost consciousness. With new AWS storage types available, we now have the opportunity to improve the approach.

Splunk Enterprise 6.4 has a new storage reduction feature that allows you to reduce the tsidx data from your buckets based upon an aging policy set by the administrator. A tsidx is a time-series index file. It associates each unique keyword in your data with location references to events, which are stored in a companion rawdata file. Together, the rawdata file and its related tsidx files make up the contents of an index bucket. The tsidx files are vital for efficient searching across large amounts of data. They also occupy substantial amounts of storage. Now, for data you regularly search, tsidx files are vital to achieve proper performance, but for colder data you can cut your costs by reducing the tsidx files on a schedule determined by the administrator. An important caveat to this, however, is that rare searches – those looking for very few items with a specific query – are going to perform significantly slower without tsidx files present.

If you are already reducing tsidx files, it means you’ve determined that your workload is much more likely to be highly sequential and less likely to require rare searching. Because the new volume types are optimized for throughput and sequential data patterns, they’re a great match for this sort of workload. Putting these two features together, allows you an opportunity to use tiered storage to provide fast, random access to your hot data, and steady, low cost throughput for your cold data. For certain configurations the combination can give you the opportunity to find the best cost and performance characteristics.

Let’s look now at exactly how we would do this. In this scenario, we’re going to use a D2.8xl with ephemeral instance storage for hot/warm buckets and st1 volumes for cold buckets. We’ve determined that our coldest data is ideal for reducing tsidx files because we only run scheduled, historical reports on it. We ingest 1 TB per day and we’d like to keep 30 days of data in high performance storage, 90 days medium performance and retain 240 days of cold data at minimal cost.

To set this up in practice, first we’ll reconfigure our cold buckets. Once you’ve mounted your storage, you can set the coldPath in your indexes.conf (see use multiple partitions for index data in the Splunk admin guide for more details). Set your warm bucket size to match 30 days times the ingest rate of 1TB per day so that it stays in that fast instance storage. When data moves to the cold bucket, it will be rehomed to the cheaper storage. To maximize our storage, we’ll set the tsidx reduction to true in indexes.conf and set the time period to 120 days (calculated in seconds). Any buckets older than 120 days will have their tsidx files reduced. Now, we have a three level logical layout for our data; hot/warm on ephemeral storage, cold with tsidx files on st1 for reasonably good performance and cold historical data with minimized tsidx on st1. Because we can regenerate the tsidx files after the fact, we can move the line between our warm data and cold data after the fact (see link to reconstituting buckets in the admin docs).

Now, let’s look at the cost model

Scenario 1:

D2.8xl, RAID 10 of 20 X 2TB drives (two drives are hot spares). To store the rest of the year’s data on GP2 EBS (20TB X 11 months) we would have a $22,000 monthly cost for EBS.

30 days on the D2 ephemeral storage +

330 days on gp2 storage = 220 TB

Approximately $22,000 monthly for storage.

Scenario 2:

D2.8xl, RAID 10 of 20 X 2TB drives (two drives are hot spares). To store the data, we would use:

30 days on the D2 ephemeral storage +

330 days on st1 storage = 220 TB

220 TB @ 0.045 GB = $9,900 monthly, a 55% savings!

Scenario 3:

Now let’s assume a data reduction of 50% without tsidx on our cold data (30 days hot, 90 days warm, 240 days cold) D2.8xl, RAID 10 of 20 X 2TB drives (two drives are hot spares). Our total storage would be

30 days on the D2 ephemeral storage +

90 days on st1 storage = 60 TB +

240 days on st1 storage without tsidx = ~80 TB

140 TB @ 0.045 GB = $6,300 monthly, a 71% savings!

If you’d like to try this out yourself, check out the Splunk Enterprise AMI for free.

David Potes
Partner Solutions Architect
Amazon Web Services

enterprise banner

Posted by


Show All Tags
Show Less Tags