UPDATE: in 4.3 and after search time fields extracted from indexed fields work without any further configuration
In the past couple of days I had to help people from support and professional services troubleshoot the exact same problem twice, so chances it might be useful for you too 😉
I have setup a regex based field extraction, let’s say the field name is MyField. When I run a search, say “sourcetype=MyEvents” I see that the field is extracted correctly. However, when I run a search based on a value of MyField, say “sourcetype=MyEvents MyField=ValidValue” nothing gets returned. WTF?
For the impatient, here’s how to solve this.
$SPLUNK_HOME/etc/system/local/fields.conf [MyField] INDEXED_VALUE = false
In order to understand what’s going on here, I need to first explain how Splunk search works. From a very high level there are two phases to search:
- retrieve a superset of the final events from the index
- filter the events after doing some processing, such as field extractions, lookups, eventtyping, tags etc (not necessarily in that order)
During the first step we optimize event retrieval based on the field values that are present on your search. One of the heuristics that we use is to assume that field values, when they are simple terms, exist as indexed tokens, or when they’re phrases their parts (phrases are made up of terms) exist as indexed tokens. What is an indexed token you may ask? When we index an event we tokenize it based on some rules and add those tokens to the inverted index. Without loosing generality let’s assume for now that we tokenize based on non-alphanumeric characters: ie every word is a token. So let’s take the above search example and see what the search asks the index for:
search: sourcetype=MyEvents MyField=ValidValue ask index: sourcetype=MyEvents AND ValidValue
host/source/sourcetype/index and some other fields are special fields that our index understands – what’s important here is that the search is asking the index for “ValidValue”. Now, if ValidValue is not a token in the index then the search would return nothing and we’d hit the problem described earlier – a more general search returns results and the field is extracted correctly, but searching for an existing value of a field returns nothing.
Now, let’s look at how can we extract a field whose value is not an indexed token – it’s actually quite simple to do: just extract a substring of a token – remember tokens are the smallest unit that the index knows about. Let’s assume that we have the following event and extraction rules:
Fri Oct 7 19:32:23 PDT 2011 – simple event to reproduce the ValidValueFieldIssue
index tokens: Fri, Oct, 7, 19, 32, 23, PDT, 2011, -, simple, event, to, reproduce, the, ValidValueFieldIssue
$SPLUNK_HOME/etc/system/local/props.conf [MyEvents] EXTRACT-test = (?ValidValue)
Another way to extract fields whose value is not in the index is to base the extraction on indexed fields, whose value is not in _raw, for example the value of source/sourcetype/host/…. is not part of _raw
$SPLUNK_HOME/etc/system/local/transforms.conf [MyExtractor] # let's assume host is ValidValue.example.com SOURCE_KEY = host REGEX = (?ValidValue)
To confirm that you’re hitting this problem and not something entirely unrelated you can try running the following type of search – which should return the expected results
search sourcetype=MyEvents MyField=* | search Myfield=ValidValue
Thus, this problem exits because of an optimization heuristic which works very well in the vast majority of the cases however it could drive you nuts if you don’t know about it. Gladly, like many things in Splunk there’s a conf file for that. Specifying INDEXED_VALUE = false in fields.conf for the field in question disables the heuristic for a particular field, at which point instead of asking the index for the field value we ask for * in it’s place (note that this does NOT mean that we’re running a * search but search … MyField=* … )
Words of wisdom
Using regexes to extract fields from events gives you great flexibility however there are some pitfalls like the one describe here that could waste a lot of your time. So, next time you’re creating a field extraction rule consider whether the extracted field value will be in the index (ie would a search of ValidValue return anything?) – if it is not then set INDEXED_VALUE = false. However don’t set this value to false blindly or unnecessarily, otherwise your search performance could suffer dramatically