Dropping Useless Headers in Splunk 6

Part 2 of our series on new ways to handle file headers in Splunk 6. The examples that we looked at last time with IIS logs actually had useful information in the file header. We picked up the field names and did index-time association with the data saving you an extra step after consuming the data with Splunk. Plus, index time means speed. But if you spend enough time with log files and ingesting data with Splunk, you’ve probably come across some formats having headers that don’t really tell you very much. They could have some startup information spanning several lines before you actually see any log data where the interesting stuff begins. Not to trash talk Websphere, but this log format is notorious for having big file headers that contain little or no useful information about the structure of the events. In these cases, we might want to just throw away the file header and start indexing events where the good stuff starts. For example:

************ Start Display Current Environment ************
WebSphere Platform 6.1 [ND cf210844.13]  running with process name splunk_cell_A\fsgwws189Node_A\splunk_A_c01_s189_m06 and process id 17904
Detailed IFix information: ID: 6.1.0-WS-WASSDK-AixPPC32-FP0000021  BuildVrsn: null  Desc: Software Developer Kit
ID: 6.1.0-WS-WAS-AixPPC32-FP0000021  BuildVrsn: null  Desc: WebSphere Application Server

And it goes on and on for another 30 lines or so until we see the end of the header and the actual beginning of the events.

************* End Display Current Environment *************
[6/8/10 6:01:30:718 EDT] 0000000a ManagerAdmin  I   TRAS0017I: The startup trace state is *=info.
[6/8/10 6:01:31:348 EDT] 0000000a ManagerAdmin  I   TRAS0111I: The message IDs that are in use are deprecated
[6/8/10 6:01:31:666 EDT] 0000000a AdminInitiali A   ADMN0015I: The administration service is initialized.

The way we tell Splunk where to start indexing the actual events and skip the header comes from the controls we introduced in Splunk 6 for props.conf. Specifically, the FIELD_HEADER_REGEX stanza. In my inputs.conf I’ve set up a monitor stanza and told it to sourcetype the input websphere.


Then in my props.conf I’ve specified what I need to make Splunk behave with the data format.

TIME_FORMAT=[%m/%d/%y %H:%M:%S %z]
BREAK_ONLY_BEFORE = \[.+:.{2}:.{2}:.{3}\s
MAX_EVENTS = 13000

I stole some of this from the Websphere App but added the FIELD_HEADER_REGEX. This tells Splunk to look for that last line of the header from above:

************* End Display Current Environment *************

And start indexing events after that. You could also use HEADER_FIELD_LINE_NUMBER if your data writes a consistent number of header lines.

Dropping these useless lines has a couple of benefits. For one, when I go to search and report in Splunk, I don’t have to craft my search to exclude the header. The other is that in most cases Splunk is pretty good about identifying the location of the timestamp within the data source but since these events have no corresponding timestamp, we might have to go look at the mod time of the file or look at the surrounding events to guess what timestamp we should apply. As a time-series index, getting timestamps correct when consuming data is paramount and this removes any question about where we should be looking for the event time.

Finally, just to show you that it actually works, when I go run a search in Splunk on my sourcetype and look at the first event, if I’ve been successful, I should see the first line of the event data from above with no header.

Yep, works.

Patrick Ogdin
Posted by

Patrick Ogdin