This documentation applies to the following versions of Splunk: 4.0 , 4.0.1 , 4.0.2 , 4.0.3 , 4.0.4 , 4.0.5 , 4.0.6 , 4.0.7 , 4.0.8 , 4.0.9 , 4.0.10
Indexing is how Splunk processes the data you send it. Splunk can index any time-series data, which is data that has a timestamp associated with it. If the data does not have a timestamp, Splunk will apply the current time to the data as it indexes it. When data is indexed, Splunk breaks it into events based on its timestamps; you can also specify other event delimiters, such as a regex match or whitespace.
All data that comes into Splunk is indexed through the universal pipeline. Data enters the universal pipeline as large (10,000 bytes) chunks. As part of pipeline processing, these chunks are broken into events. Initially, newline characters signal an event boundary. In the next stage of processing, Splunk applies line merging rules specified in props.conf.
As part of indexing, events are broken into sections called segments. Splunk uses a list of breaking characters and other rules (such as the maximum number of characters per segment) that are configurable through segmenters.conf.
Indexing is an I/O-intensive process. If you're building a system to index a lot of data, Splunk recommends you take this into consideration.
While Splunk is indexing data, one or more instances of the splunk-optimize process will run intermittently, merging index files together to optimize performance when searching the data. The splunk-optimize process can use a significant amount of cpu, but should not consume it indefinitely, only for a short amounts of time. You can alter the number of concurrent instances of splunk-optimize by changing the value set for maxConcurrentOptimizes in indexes.conf, but this is not typically necessary.
splunk-optimize should only run on db-hot.
You can run it on warm DB's manually if you find one with a larger number of .tsidx files (more than 25) - ./splunk-optimize <directory>
If splunk-optimize does not run often enough, search efficiency will be affected.
Splunk stores all processed data in indexes. Indexes, in turn, are stored in databases, which are located in $SPLUNK_HOME/var/lib/splunk. A database is a directory named db_<starttime>_<endtime>_<seq_num>. An index is a collection of database directories.
Splunk comes with preconfigured indexes:
Read About managing indexes in this manual for more information.