Lost in Translation?

Maybe not. I’ve always wondered why we tend to think the heart of things gets lost in interpretation. I’d like to think there are some pretty amazing things found in translation, like a good chuckle or new revelation. Poetry and idioms aside, lots of things come to mind which are are enhanced by interpretation or transformation. Take Exhibits A and B.

And when your girl or guy says “I’m hungry” your quality of life will be much improved if you can immediately decipher this to mean “sudo make me a sandwich.” Then there’s URL encoded logs or encrypted data (not to be confused with cryptic data). What they’re really telling you is… well, actually that’s not quite clear yet!

If it is your divine fortune in life to encounter data which is encoded, encrypted or otherwise obscured, there is actually a way to process it intelligently so it is both readable and searchable in Splunk.

A Solution

This solution comes to us from a savvy customer who has a hot mess of data to analyze in order to understand where problems exist in their web-based applications. Application logs are generated by millions of clients globally and aggregated by several dozen central servers. The logs are rotated daily and made available as raw files. Events are partially text, containing only a single field which is URL encoded. Many parties are keen on unlocking this data to better understand error density and user behavior.

The key requirements are:

  • Splunk must manage the entire processing sequence starting with real-time monitoring
  • transparent searchability, end users should not need to change their search behavior
  • due to existing high load on the servers no changes to the current collection infrastructure will be accommodated (this is a no Splunk forwarder zone).

We evaluated several options, but eventually settled on something which met all three requirements. First, let’s cover the solution, then I’ll review alternatives considered.

For illustration, let’s say the URL encoded events have this structure:

After URL decoding, we want the events to look like this:

In pseudo-configuration, this is the shape of the solution:

  1. Index the encoded data source into a separate index.
    For this exercise, we chose to separate the encoded data into its own index which is not searchable by default, index=encoded.

  2. Create a custom search command to perform the URL decoding.
    The search command is implemented as a python script called which also preserves originating host.

  3. Combine the custom search command with the collect command which writes to a new index.
    The collect command from the summary indexing tool chest allows us to take a search, process the contents, then write the results to an index of choice. In this case, we chose to write the final results to a separate index.

    index=encoded | decode | collect index=decoded file=”$timestamp$_$random$_decoded.log”

  4. Schedule the decode + collect as a search which runs every n minutes.
    To accommodate reasonable potential lags in indexing, we scheduled the search every five minutes to process events starting (now – 10 minutes) to (now – 5 minutes).

  5. Update config for decoded events eventually sourcetyped ‘stash’ by the collect command.
    This involves overriding the default rules for linebreaking and automatic field extraction.

  6. Configure Splunk to honor the originating host.
    Since the events are being reprocessed by Splunk, the originating host is overridden with the host performing this reprocessing. We planned for this by augmenting the new raw event with a field for the original host in, but also needed to account for this in the host configuration.

  7. Smile and search.
    Now that the data has been decoded, to search simply preface searches with ‘index=decoded’ or configure the index as a default index.

The Guts and the Glory

Below are the gory details–scripts and configuration for all the steps above. The Search App was chosen to be the configuration location. Some of this can be done via the Manager in SplunkWeb, but for simplicity configuration is presented here in .conf form.

$SPLUNK_HOME/etc/apps/search/bin/ (full script here):

(Since this blog was posted, a faster more scalable version of the script was written. It uses the new streaming method for retrieving events, which has a much reduced CPU load. It also removes the need to use the collect command and gets us around the max events limit. This will make backfill much easier. Backfill can be performed in a single invocation of the script, rather than many non-overlapping invocations. The new script is posted here.)

# url decode the url encoded value of the content field
decodedContent = urllib.unquote(content)
# replace original _raw field with new decoded _raw +
# include original host as a new field
newRawEvent = rawEvent[0:string.find(rawEvent," content=")+8] + \
decodedContent + " orig_host=" + r["host"]
r["_raw"] = newRawEvent


filename =
retainsevents = true

$SPLUNK_HOME/etc/apps/search/local/indexes.conf (full indexes.conf here):

homePath = $SPLUNK_DB/encodeddb/db
coldPath = $SPLUNK_DB/encodeddb/colddb
thawedPath = $SPLUNK_DB/encodeddb/thaweddb
homePath = $SPLUNK_DB/decodeddb/db
coldPath = $SPLUNK_DB/decodeddb/colddb
thawedPath = $SPLUNK_DB/decodeddb/thaweddb


BREAK_ONLY_BEFORE = ^\d{4}-\d{2}-\d{2}
TRANSFORMS-ep = setHost
# Re-enable default field extractions for convenient searching
KV_MODE = auto


DEST_KEY = MetaData:Host
REGEX = orig_host=(\S+)
FORMAT = host::$1


# To account for reasonable indexing lag, we schedule this with a 5 minute buffer.
[Scheduled URL Decoding]
cron_schedule = */5 * * * *
dispatch.earliest_time = -10m@m
dispatch.latest_time = -5m@m
enableSched = 1
search = index="encoded" | decode | collect index=decoded \

Hmmm (Things to Consider)

Along the way, we had to factor in some interesting behavior and challenges encountered.

  • Licensing
    In order for the decoding to be excluded from the daily license calculations the sourcetype ‘stash’ must be retained. If you choose to change the sourcetype, Splunk will consider the events indexed twice–first into index=encoded, second into index=decoded–and will deduct from the license capacity accordingly. This can be avoided by keeping the sourcetype as ‘stash’. Beware not to store other sourcetypes in index=decoded, otherwise everything in the index will be charged against the license twice. Eeps.
  • Storage
    While you may not be double-charged in the license usage, this solution essentially double-stores each event. To minimize the impact, consider reducing the retention policy on the index where the encoded events are first indexed (in our case, index=encoded).
  • Adminstration
    We chose to decode into a separate index. This will serve to logically and physically separate the decoded data from the rest of the indexed data for administration purposes. This simplifies the configuration of access controls and allows for a different retention policy to be applied. This can be easily changed to index into the main/default index, if preferred.
  • What About a Search Language-Only Solution?
    This solution can also be implemented without a custom search command, instead using only the available and more familiar search commands–eval, fields, collect.

    index=encoded | eval decodedContent=urldecode(content) | rex field=_raw “^(? .*)\scontent=” | eval decoded_event=begin+” content=”+decodedContent | fields – _raw | fields + decoded_event | rename decoded_event as _raw | collect index=decoded file=”$timestamp$_$random$_decoded.log”

    While this may be more convenient for our particular solution, the advantage of using a script or custom search command is it is capable of doing more complex processing. If the requirement is not merely to URL decode, but decrypt or translate, there is a much wider range of operations which can be performed in a script (e.g. query a database or key store, break output into multiple events).

  • Custom Search Script Limitations
    We seemed to hit a limit with the number of events the python script could process in a single invocation–somewhere around 10,000 events. It’s not clear if this is a limitation of the python script or Splunk’s ability to pass data off to the script or custom search commands in general. In any case, if this is a concern, schedule the decoding search on a more frequent interval (e.g. every minute, instead of every five minutes). Alternatively, use the new version of the decode script, which retrieves events via streaming and has no max event cap. Or if possible use Splunk’s native search language as described in the paragraph above. This should also remove any limitations on the number of events processed.

  • Manageability
    Finally, one of the stickier aspects of this approach is its manageability. Should the indexer need to go offline (let’s say for upgrades), one or more iterations of the scheduled decoding may not execute. This means a gap in the decoding will exist. When this occurs, a manual backfill will need to be performed. For visibility into the execution progress you can watch Splunk’s job scheduler via a scheduled alert.
  • Alternative Approaches

    Before settling on the solution above, we reviewed and implemented many other options. The first alternative is much preferred by Splunk, but was not employed in this instance because it did not meet the key requirements.

    • Option #1: Pre-process the data with an external tool/script before Splunk monitors it.
      The script can be scheduled through the OS as a service/job. Many versions ago, Splunk used to support the insertion of pre-processors in the indexing pipeline, but this is no longer the case.

      Advantage: Keyword searches will operate on the unencoded/decrypted field content so searching can be performed like business as usual. Configuration dependencies in Splunk are greatly reduced, and removes the risk of having to manually backfill should the Splunk indexer go offline.
      Disadvantage: Requires additional infrastructure external to Splunk to pre-process data.

      Discussion on Splunk Answers:

    • Option #2: Feed data at frequent intervals to a Splunk batch input, then use the unarchive_cmd parameter in props.conf to decode at index time.
      This would utilize the same framework used to index archived data (e.g. .gzip, .bzip, etc.).

      Advantage: Performs in-stream decoding of the data which minimizes configuration.
      Disadvantage: Existing log rotation policy may need to be adjusted for frequent batch loading. Will not work if data cannot be collected by Splunk forwarders.

    • Option #3: Have Splunk index the data as is, then use an external lookup to decode/decrypt fields at search time.
      The external lookup can be implemented as a perl/bash/python script. Splunk will present the decoded field as a new separate field (screenshot below).

      Advantage: A lookup is easy to configure and is transparent to the end user.
      Disadvantage: Keyword searches do not operate on the decoded field value, so users must preface searches with the decoded field name (e.g. ‘decodedContent=*interesting things*’ vs. ‘interesting things’).

    • Option #4: Use the Splunk search language.
      Splunk can URL decode fields at search time by invoking the ‘eval’ command (e.g. sourcetype=”encodedstuff” | eval decodedContent=urldecode(content) ). Similarly, a custom search command can be created for the decoding or anything more sophisticated involving decryption.

      Advantage: Requires no configuration and produces the same result as an external lookup (option #3)
      Disadvantage: Same limitations and much less transparent than the external lookup (option #3), requires users to know and use more advanced/custom search commands.


    If anything, this exercise is a testament to Splunk’s flexibility. There are multiple ways to accomplish this task while balancing several sticky restrictions. Hopefully that is clear. What is unclear is why the data is stored in URL encoded form in the first place. 😉 In any case, whether your data is encoded, encrypted or obscured, there are lots of gems to be gained in translation. I am optimistic you can adapt these strategies to your own environment and requirements. Let us know how it goes. And If you think of another way to do this, we’d love to hear it.

Vi Ly

Posted by