Splunk's core competency is indexing and searching any type of IT data with speed and efficiency. This versatility can present challenges to both new and seasoned users of Splunk when attempting to identify factors that can affect performance. This section reviews a variety of factors and offers suggestions on how to tune Splunk for a given deployment.
Segmentation is how Splunk identifies items to index in your IT data that aren't key/value pairs or fields. These indexed items, or segments along with fields are the building blocks inside IT data that search capabilities are built upon. Tuning segmentation can lead to greater indexing performance by lowering the total processing required to index any line of IT data and increasing the potential for compression effectiveness..
Splunk maintains two concepts of segments, called major and minor segments.
For example, the IP address 192.168.1.254 would be indexed entirely as a major segment and then broken up into the following minor segments: 192, 192.168, and 192.168.1.
Segmentation impacts indexing and data storage performance directly based on the data set in use.
You can completely disable segmentation, which allows for maximum indexing performance and storage efficiency. Of course, this comes at the expense of search convenience and search speed. With segmentation disabled, you can perform searches using the regex search directive (which provides full regular expression search capabilities), search using information indexed in a search fields, or search using a combination of the two.
Note: Searches that involve regex take longer to execute due to the processing required to find regular expressions in IT data.
Splunk can automatically extract the source hosts from a given piece of IT data, which is useful in situations where data is being aggregated before arriving at Splunk to be indexed.
Splunk can also identify timestamps in any given piece of IT data from a variety of formats, which can not only help in pre-aggregated data cases but also with data sources that embed their timestamps in non-standard formats.
The combination of indexing options you select ultimately defines how convenient it is to search your IT data. Any combination of the above options is supported and can be implemented on a per source or source type basis. This lets you minimize the index overhead associated with data that is not searched frequently, while making commonly searched data more convenient for users.
A great example of how this can used to optimize a Splunk deployment would be when using Splunk for IT policy compliance. Splunk can be used to search proxy server and transaction logs for user access monitoring and user activity search, while also serving as a central repository for other types of IT data such as system logs that must be retained but may be of less interest to a compliance administrator.
To maintain maximum convenience and allow for saved searches to run quickly and efficiently, the maximum amount of segmentation should be applied to the proxy server and transaction logs which would be configured as discrete sourcetypes. Additional search fields may also be desired to quickly identify certain key/value pairs that may be of interest. System logs, also a discrete sourcetype, could have segmentation disabled given that they are simply being aggregated and stored to adhere to the IT control or mandate.
If you're having problems with odd data being presented from your otherwise "normal" sources, such at incorrect times being reported from a firewall log, ensure the sourcetype is correctly set. Edit %homedir%/splunk/opt/etc/system/local/inputs.conf to view and edit your inputs. View the Wiki page here on input types, there's a ton of them Splunk reconizes by default -- with sample of what your data may look like to help you match up the type. If your type isn't here, you can train Splunk to recognize it (another topic).