Sourcetypes Gone Wild

Tips & Tricks February 11, 2010 Splunk

HELP, I have 515 sourcetypes!

Splunk can help bring order to the chaos of IT systems. But when Splunk itself is in disarray it can hinder your powers of search and put a serious damper on your Splunk experience. Giving Splunk a few pointers can make a big difference.

Because it can index anything, it is tempting to let Splunk loose on your entire data repository and expect it to sort it all out automatically. We are working on a better setup and getting started experience, but until then Splunk can’t read your mind. For now, it can definitely use a bit of coaching.

Without coaching you could end up with non-descriptive sourcetypes called ‘breakable_text’ or ‘too_small’ or better yet hundreds of sourcetypes–sourcetypes gone wild. To whip things into submission, simply add rules for applying sourcetype.

Woah, backup the trolley. What is a sourcetype?

A sourcetype is Splunk’s term for data of a specific format. For example, http access logs are known as access_common or access_combined. Splunk ships with a set of sourcetypes, which means there are pre-configured rules for recognizing timestamps/field extractions/line breaking. You can also define your own sourcetypes. Rules can be applied to the default or any new sourcetypes created. For more on sourcetypes: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter

So what? Why should I care?

If you want to find the links on your website which are broken, then having sensible sourcetypes will allow you focus on just the http access logs and not your database logs or syslog or your mama’s logs.

If you are auditing access to your database cluster, then it’s really not important to search anything but your database logs. Having well-organized sourcetypes will filter out the noise.

With sourcetypes under control, searches on http access data can start out with “sourcetype=access_common.” It’s just one more way to pivot on all the events in your Splunk datastore.

Scenario 1: The logs are scattered in different directories.

Your directory structure looks something like this:

/lotsoflogs/http/access.log
           /wls/weblogic.log
           /db2/db2diag.log
           /myapp/customapp.log

Easy peasy. Add the inputs one at a time, and specify a sourcetype during setup. This can be done in the SplunkWeb Manager, or via inputs.conf:

[monitor:///lotsoflogs/http/access.log]
sourcetype = access_common

[monitor:///lotsoflogs/wls/weblogic.log]
sourcetype = weblogic_stdout

[monitor:///lotsoflogs/db2/db2diag.log]
sourcetype = db2_diag

[monitor:///lotsoflogs/myapp/customapp.log]
sourcetype = mycustomsourcetype

Scenario 2: All the log files are belong in a single directory.

Your directory structure looks more like this:

/lotsoflogs/access.log
            websphere.log
            db2diag.log
            customapp.log

First, define a single input for the parent directory, then add rules to apply sourcetype. This currently cannot be done using the SplunkWeb Manager entirely. The sourcetyping rules are added by editing props.conf.

In inputs.conf:

[monitor:///lotsoflogs]

In props.conf:

sourcetype = access_common


sourcetype = websphere_out


sourcetype = db2_diag


sourcetype = mycustomsourcetype

Scenario 3: The logs are stored in arbitrarily nested subdirectories, say by host. It’s really a combination of Scenario 1 and 2.

Your directory structure looks something like this:

/lotsoflogs/host1/access.log
                  weblogic.log
                  db2diag.log
                  customapp.log
/lotsoflogs/cluster1/host1/access.log
                           weblogic.log
                           db2diag.log
                           customapp.log

The answer is the same as the one for Scenario 2. The point is no matter how your directories and files are nested, you can set sourcetype rules using the name of the file.

Scenario 4: The same log or data stream contains different types of events, and thus differing event formats.

This is ugly logging practice and should be avoided. We do understand, however, as Splunk Admins you may not always have influence over how data is collected and how it’s formatted. So here goes…

Consider a sample snippet from a single source containing at least 2 differing event formats:

Example file /path/to/sample.log:

Jun 24 12:21:36 12.34.56.78 %PIX-6-302016: Teardown UDP connection 12345 for INET:1.2.3.4/29585 to inside:5.6.7.8/53 duration 0:00:01 bytes 265
Jun 24 15:57:12 12.34.56.78 %PIX-6-302016: Teardown UDP connection 678910 for INET:1.2.3.4/29585 to inside:5.6.7.8/53 duration 0:00:01 bytes 265
Jun 24 18:34:45 10.2.1.44 sshd(pam_unix)[17188]: session closed for user footer
Jun 24 19:36:19 10.2.1.44 su(pam_unix)[9795]: session opened for user foobar by (uid=0)

The first 2 events are Cisco PIX events while the last 2 are plain syslog events. To assign the 2 types of events to 2 separate sourcetypes the configuration will look something like the following.

In props.conf:

TRANSFORMS-yummy = setCPSourcetype, setSyslogSourcetype

In transforms.conf:

[setCPSourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = %PIX-
FORMAT = sourcetype::cisco-pix

[setSyslogSourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = \w+ \d+ \d+:\d+:\d+ \S+ \w+\[\d+\]:
FORMAT = sourcetype::syslog

The use of the DEST_KEY parameter instructs Splunk to rewrite the default sourcetype field to the new value in the FORMAT parameter. The REGEX parameter needs to be flexible enough to capture all events of a particular format but restrictive enough to only capture events of the particular format. Scenario 4 is fairly uncommon, but it is included here for completeness.

Happy Sourcetyping

Don’t forget to restart Splunk after applying these changes to inputs.conf or props.conf. I hope this adventure in sourcetyping helps you on your way to a less unruly and more orderly Splunk.

----------------------------------------------------
Thanks!
Vi Ly

Style

two-column

No results

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer