One of the first questions customers ask when they start considering index replication is about storage requirements. Index replication keeps additional copies of data for redundancy purposes, but how would it affect the storage needs and what are the factors to consider in designing scalable storage architecture are the main questions. I’ll cover the important factors in this blog post.
There are two major dimensions to consider. First one is the replication policies and the second one is the data retention period.
Replication Factor (RF) and Searchability Factor (SF) control the replication policies. RF determines the number of raw data files to keep while SF determines the number of time series indexed files. For syslog data, the raw data files take about ~ 15% of disk space and index files takes about ~ 35% of disk space.
The second dimension is retention period. This determines how long you want to keep the data in Splunk before aging out the old data. Typical aging policies are 3 months to 6 months, although we have seen cases were the retention period is in years.
Let’s walk through an example to see these numbers in action. Assume that the daily indexing volume is 200GB, RF and SF is set to 2 and we have a 2-node cluster. Let’s use a retention period of 45 days.
Raw data files related storage needs = 15% * 200 * 2 * 45 = 2.6 TB
Index data files related storage needs = 35% * 200 * 2 * 45 = 6.4 TB
Total space required on the cluster to store 45 days of data = 2.6 + 6.4 = 9 TB
Space required on an individual peer = 9 / 2 = 4.5 TB.
So, using this little formula we have roughly identified that we need 9 TB of disk space on the entire cluster to store, replicate, and retain data for 45 days. You can adjust the retention period and replication policies to see how it would affect your storage needs.