Topics

| pdf version

How Splunk Works


Splunk > The IT Search Company

  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk

Localized Splunk documentation

Looking for Splunk documentation in other languages?

How source types work

This documentation does not apply to the most recent version of Splunk.

This documentation applies to the following versions of Splunk: 3.2 , 3.2.1 , 3.2.2 , 3.2.3 , 3.2.4 , 3.2.5 , 3.2.6

How source types work

A source type is any common format of data. sourcetype= is one of Splunk's default fields (it's indexed and stored with every event). It provides an easy way to find similar types of data from any input. For example, you might search sourcetype=weblogic_stdout even though weblogic might be logging from two different domains.


Source vs source type

Source is also one of Splunk's default fields, indexed and stored with every event as source=. It refers to any file, stream, or other input sending data to Splunk. For data coming from files and directories, the value of source is the full path, such as /archive/server1/var/log/messages.0 or /var/log/. The value of source for network-based data sources is the protocol and port, such as UDP:514.


Different sources can have the same source type. For example, you may tail source=/var/log/messages and receive direct syslog input from udp:514. You can find both by searching for sourcetype=linux_syslog.


How sourcetype= values are set

Automatic source type classification

During indexing, Splunk classifies source types automatically by calculating signatures for patterns in the first few thousand lines of any file or stream of network input. These signatures pick up things like repeating patterns of words, punctuation patterns, line length, etc. Once Splunk has calculated a signature, it compares the signature to previously seen signatures - if it's a radically new pattern, Splunk creates a new source type. Learned pattern information is stored in sourcetypes.conf.


To configure your own automatic source type recognition, use Splunk's rule-based source type feature. Rule-based source types are automatically assigned based on regular expressions you specify in props.conf. Learn more about how to configure rule-based source types.


Rename source types

To assign new source type names, edit sourcetypes.conf . However, this only changes the name of future data inputs. To change the source type for events that have already been indexed, create an alias for a source type. Aliasing source types is a cosmetic change that allows users to search for source type values that make sense.


NOTE: If you set indexing properties for a source type in props.conf, you must use the actual stored source type value from sourcetypes.conf.


Train the sourcetype auto-classifier

To customize source type names, use Splunk's auto-classifier with a set of representative example files. If you train it with a wide enough range of files that you'd like share the same source type, it learns more good rules. Then, Splunk's recognition improves for new indexed files of that source type. Pre-training is how Splunk ships with the ability to assign sourcetype=syslog to most syslog files.


Bypass Splunk's auto-classification, skip the training step and simply hardcode a sourcetype for each data input. However, training may still be more effective if you plan to have Splunk index entire directories of mixed sourcetypes (such as /var/log). Learn how to train Splunk to recognize sourcetypes.


If Splunk fails to recognize a common format, or misclassifies it, we encourage you to report the problem and send us a sample file so we can improve the product. You can anonymize your file using Splunk's built in anonymizer too.


Hard-coded source type assignment

Bypass automatic source type classification entirely and simply set a source type yourself when you configure a data input. See setting source type for an input. However, this method is not very granular -- all data from the same host or source will be assigned the same source type name.


If you need to give different sources with in a single directory input different names, you can try setting source type for a source.


Source type precedence

Splunk sets source types in this order:


1. Explicit specification of source type per input stanza in inputs.conf:


[tail://$PATH]
sourcetype=$SOURCETYPE

2. Explicit specification of source type per source by creating a stanza in props.conf:


[$SOURCE] 
sourcetype=$SOURCETYPE

3. Rule-based association of source types:


Allows you to match sources to source types using classification rules specified in rule:: stanzas in props.conf.


4. Intelligent matching:


Matches similar-looking files and creates a source type.


5. Delayed rules:


Works like rule-based associations, except you create a [delayedrule:: ] stanza in props.conf. This is a useful "catch-all" for source types, in case Splunk missed any.


6. Automatic source type learning:


Splunk creates new source types based on sources that don't already have source types associated with them.


Custom indexing linked to source types

Tie custom indexing properties to any source type via props.conf. Just set the source type as the <spec> above a props.conf stanza. Here are a few things you can do:


Tweak default processing

When Splunk indexes a data source, it automatically breaks the input into distinct events and extracts a host and timestamp for the event. The event boundaries, host, and timestamps are important for analysis. If Splunk does not set the event boundaries or extract timestamps and hosts correctly, you can easily modify these settings. See timestamp recognition, how host is assigned, and event boundaries for more information.


Mask sensitive data

Your logs may contain sensitive personal data. For example, your logs may contain social security numbers or passwords that you wish to cover up. You can create event configuration that masks sensitive data as it is processed on input.


Change indexing density

When Splunk indexes data, it segments events via major and minor breakers. To save storage space on the indexer, you can edit Splunk's default segmentation settings. For example, web proxy logs may contain lengthy URLs that Splunk breaks into many different minor segments. You may wish to change this setting to eliminate unnecessary overhead.


Eliminate processing steps

Certain processing steps can be eliminated to provide faster indexing and better throughput. For example, if you don't need Splunk to search for timestamps within events, you can turn off timestamp extraction.


Field extraction linked to source types

Associate extracted field rules with source types. Like custom indexing properties, field extraction rules are based on the stored sourcetype: value set at index time, so aliasing a source type won't cause it to pick up new rules. You'll need to either hardcode a correct sourcetype: value for the source or input, or train Splunk.


Configuration files for source types

Set source type for a source in inputs.conf. Configure custom indexing properties and rule-based associations of source types via props.conf. Before manually modifying any configuration file, read about configuration files.

Revision: 207 | Contact | Privacy Policy | Terms of Use | Community content licensed under Creative Commons