When entropy meets Shannon

This is the third post on URL analysis, please have a look at the two other posts for more context about what can be done with Splunk to analyze URLs:

You will find in this article information on how one can detect DNS tunnels. While you can find lots of very useful apps on Splunkbase to help you analyze DNS data, it is always good for curious individuals to discover some techniques being used underneath.

A lot of captive portals are bypassed everyday by anyone able to run a DNS request, if someone can run on their machine the following command:

$ host has address

Without being authenticated on the captive portal, then they can use any service on the internet using a DNS tunnel. There are a lot of tools out there to create those tunnels. And for a great paper on the topic, I encourage you to read the Detecting DNS Tunneling from SANS Institute.

Claude Shannon to the rescue!

Claude Shannon
By Jacobs, Konrad -, CC BY-SA 2.0 de, Link

Long time ago, the venerable Claude E. Shannon wrote the paper “A Mathematical Theory of Communication“, which I strongly encourage to read for its clarity and amazing source of information.

He invented a great algorithm known as the Shannon Entropy which is useful to discover the statistical structure of a word or message.

If you consider a word, being a discrete source of the finite number of characters type which can be considered, for each possible character there will be a set of probabilities which would produce various outputs. There will be an entropy for each character. This entropy on the chosen word is defined as the average of the output weighted on the probability of occurrence of the characters.

The previous paragraph can easily be translated into the following Python code (taken from the excellent URL Toolbox on Splunkbase:

def shannon(word):
    entropy = 0.0
    length = len(word)
    occ = {}
    for c in word :
        if not c in occ:
            occ[ c ] = 0
        occ += 1

    for (k,v) in occ.iteritems():
        p = float( v ) / float(length)
        entropy -= p * math.log(p, 2) # Log base 2
    return entropy

Which can be run directly from any word you can have in Splunk:

Splunk search Shannon entropy

As you can see, the score is pretty high, which makes sense since there is a high variety of frequency over those data. If we click on the ut_shannon field to sort in reverse order, this is what you could get:


As one can see, words of low characters distribution get a low score.

Catching DNS tunnels from subdomains in URLs

If we run the following query, interesting results are shown:
sourcetype="isc:bind:query" | eval list="mozilla" | `ut_parse(query, list)` | `ut_shannon(ut_subdomain)` | table ut_shannon, query | sort ut_shannon desc


As you can see in the results here, the high score come from tunnels made to the domain as well as something which is unknown but could also be a tunnel: traffic towards


I hope this post helps you to see tools and methodologies one can use to find out unusual activity strictly based on the DNS traffic. More to come…


Sebastien Tricaud

Posted by