TIPS & TRICKS

Delimiter based key-value pair extraction

As described in my previous post, key-value pair extraction (or more generally structure extraction) is a crucial first step to further data analysis. While automatic extraction is highly desirable, we believe empowering our users with tools to apply their domain knowledge is equally important. To this end, this post introduces one of the simplest forms of key-value pair extractions (KV-extraction) – delimiter based extraction.

Observation

Most logged events usually contain a list of key-value pairs (e.g. attribute list, method call values etc) in a context-dependent well-defined format. An example of well-defined format: ” key-value pairs are separated from each other using ‘;’ while the key is separated from the value using ‘=’ “. More generally, well defined attribute listing formats are not confined to logging, they’re part of every event-driven, flexible attribute order, application: e.g. URL get parameter list, HTTP request/response headers, email headers etc… In most application the delimiters are single characters which are least likely to be part of the key or value, whenever the key/value contains any of the delimiters it is normally enclosed in literal-defining characters usually double-quotes (“).

Definition: delimiter based KV extraction
Let’s first define three character classes:
1. [pairdelim] – non-empty list of characters used to separate key value pairs from each other. (chars after value, before next key)
2. [kvdelim] – non-empty list of characters used to separate the key from the value. (chars after key, before next value)
3. [quoter] – list of characters used to enclose a literal – currently *only* quotes are supported and this variable is not configurable

Thus we can formally define a key-value pair list as follows:

kvlist = <key>[kvdelim]<value>([pairdelim]<key>[kvdelim]<value>)*
key = <string>|<quoter><string><quoter>
value = <string>|<quoter><string><quoter>
quoter = "

Thus, delimiter KV-extraction can be achieved by a two layer tokenization/splitting process:
1. Split on the pair delimiter to extract candidate KV pairs
2. Split on key-value delimiter to separate key from value

Examples:
1. URL – the following is an example of finding what are the delimiters for parameters listed in the query part of the URL. Note, since the query starts after the ‘?’ character – the ‘?’ qualifies as a key-value pair delimiter since it is before the first key.

data = "http://usasearch.gov/search?input-form=firstgov&v%3Aproject=firstgov&query=splunk+it&affiliate=uspto&x=0&y=0"
-------------
pairdelim is "?&" - # parameters in the query are separated by '&', the query starts after '?'
kvdelim is "=" - # variable names are separated from their values using '='

2. HTTP response header:

data = "GET / HTTP/1.1
Host: dev.splunk.com
Connection: close
User-Agent: Web-sniffer/1.0.25 (+http://web-sniffer.net/)
Accept-Encoding: gzip
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Referer: http://web-sniffer.net/"
-------------
pairdelim is "\r\n"
kvdelim is ":"

That was easy, why don’t you try it on the following data?
Note: data_hard is all one line however our blogging software sucks at displaying long lines

data_easy = "May 4 14:47:28 gwrk1 sshd(pam_unix)[4572]: 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=test.abc.net"
data_hard = "loc=544078|action=encrypt|i/f_dir=inbound|i/f_name=eth4c0|__policy_id_tag=product=VPN-1 & FireWall-1[db_tag={xxxx-6123-40CA-XXXX-9620355xxxxx};date=1190818929;policy_name=NYC-BellSouth NY]|src=10.100.0.50|dst=10.104.0.21|proto=icmp|rule=9|scheme:=IKE"

Delimiter based KV extraction as part of kv/extract command
OK, great! Now that you know what delimiter based KV extraction is and how to find the list of characters that are used as pair delimiters (pairdelim) and key-value delimiters (kvdelim), let’s look at how to instruct splunk to perform this type of KV extraction. Well, all you need to do, is add the delimiters as arguments to kv/extract, as follows:

..... | kv pairdelim="?&" kvdelim="=" | .....
or
..... | extract pairdelim="?&" kvdelim="=" | .....

Configuration for automated delimiter based KV-extraction
transforms.conf is the key-value extraction configuration file. Delimiter-based KV extraction adds another configuration variable to the transforms.conf vocabulary called DELIMS – yes you guessed right this is where we’ll specify the pairdelim and the kvdelim. The format of DELIMS is as follows:


DELIMS = <pairdelim>, <kvdelim>

Example:

in ...bundles/local/transforms.conf
.....
# this is equivalent to ..|kv pairdelim="?&" kvdelim="=" |...
[my_extraction]
DELIMS = "?&", "="
.....

You can then use the newly created transform just like any other transform. To remind the forgetful, you can do:
1. ….. | kv my_extraction |….
2. Automatically run”my_extraction” based on source/sourcetype/host with REPORT-* config variable in props.conf

----------------------------------------------------
Thanks!
Ledion Bitincka

Splunk
Posted by

Splunk

Join the Discussion