The following is copped from a support email by Stephen Sorkin who is the man behind the splunk server curtain … thought it should go broader.
I’m the manager of the search and indexing team at Splunk. We’re still in the process of writing up our findings from storage benchmarks but here are the general details.
High IO/s typically means both faster indexing in general and faster searching of rare, temporally incoherent events. On average, we’ve seen indexing speeds increase by about 66% going from an 7200 RPM SATA RAID to a 15K RPM SCSI RAID. We’ve seen comparable performance from SCSI and SAS RAIDs, provided they’re 15K RPM.
The best best benchmarking tool we’ve found for measuring how Splunk will behave on your disk hardware is bonnie++. If your disk subsystem can sustain 800 IO/s, you’re in good shape.
As far as searching goes, IO/s is the dominant factor for non-coherent, infrequently accessed search results. This means, if you’re just searching for the newest data, or even have to reach back through 1MM events to return 10k, the disk is NOT the bottleneck, since each individual read() will pull many events off disk. However, if you’re searching for a rare term, like a name, that occurs once an hour or once a day, each read() is going to require the drive arm move. If you’re using a 7200 RPM SATA drive, that’s about 100 IO/s and hence on the order of 100 retrieved events per second. If you have a decent RAID, that could be 800 retrieved events per second.