Key-value pair extraction definition, examples and solutions….

Most of the time logs contain data which, by humans, can be easily recognized as either completely or semi-structured information. Being able to extract structure in log data is a necessary first step to further, more interesting, analysis. While it would be great to be able to automatically extract the structure from all log data, splunk cannot rival the brain’s performance at this time, however it is able to tap into your brain for help :) Read on ……

Problem definition:

Extract structured information (in the form of key/field=value form) from un/semi-structured log data. Note: for the purpose of this post key or field are used interchangeably to denote a variable name.

Problem examples:

Splunk debug message (humans: easy, machine: easy)

12-03-2007 13:51:55.114 DEBUG SearchPipelinePerformance - processor=save queryid=_1196718714_619358 executetime=0.014secs
ideal structured information to extract:

Splunk tries to make it easy for itself to parse it’s own log files (in most cases)

Output of the ping command (humans: easy, machine: medium)

64 bytes from icmp_seq=0 ttl=64 time=2.522 ms
ideal structured information to extract:
time=2.522 ms

An interesting pattern to note here is that there is no consistent field-value delimiter, nor field-value order. In the “from” field the authors have chosen to use a space as a delimiter, while for “icmp_seq”, “ttl” and “time” they’ve chosen the equal sign. For the “bytes” field they’ve chosen to place it after the value (yes, they might have also intended for it to mean bytes – the data unit) while for the rest they’ve chosen field-name followed by field-value. Admittedly, some might think the current format is prettier than the following consistent log line which could easily be parsed by machines. (Who thought log files were optimized for prettiness !?)

bytes=64, from=, icmp_seq=0, ttl=64, time=2.522 ms

NetScreen log (humans: medium, machine: hard)

%MD% %DD% 13:41:25 NOC-FWa: NetScreen device_id=NOC-FWa [Root]system-notification-00257(traffic): start_time="2006-05-11 13:40:23" duration=62 policy_id=41 service=Network Time proto=17 src zone=noc-mgt dst zone=noc-svcs ......
ideal structured information to extract:
start_time=2006-05-11 13:40:23
service=Network Time
src zone=noc-mgt
dst zone=noc-svcs

This part of the NetScreen log line …service=Network Time proto=17 src zone=noc-mgt dst zone=noc… is a salient example of the ambiguity that sometimes exists in log data. What is the correct value of service ? “Network” or “Network Time”? What about the name of the next field? Is it “Time proto” or just “proto”? Well, we can come up with an easy rule for this case, let call it Rule-1: “Field names should NOT contain spaces”. Fair/good enough!
Let’s move on to the next field, what is it’s correct name? “src zone” or just “zone”? A human can recognize that “src zone” is the correct field name, thus we just violated the our Rule-1, we can continue our cycle of adding/violating/modifying|removing rules to our rule set only to recognize that the cycle never ends – which simply translates into “there is no one solution/rule-set that is able to extract structure from ALL unstructured data” – there will always be a degenerate case that violates the rules.

More degenerate log lines:

Stay tuned! Links in this section are coming soon….


Delimiter based key-value pair extraction
Delimiter base KV extraction – advanced
Stay tuned! More links coming soon….

Ledion Bitincka

Posted by