Tips & Tricks

January 26, 2024

5 Minute Read

Text Vectorisation, Clustering and Similarity Analysis With Splunk: Exploring User Agent Strings at Scale

By Josh Cowling

I recently spoke with a few Splunk customers who all face the same challenge:

“How do I make use of URI path and user agent data to understand what my users are doing or to find malicious behavior?”

URI paths and user agents are common data that can be collected from web access logs, proxies, web application firewalls (WAF), and in some cases network taps like Zeek. They tell us what sources are interacting with what endpoints and what type of client the endpoints are using.

Despite being so common, user agents can be difficult or intimidating to work with and are hard to use as a launching off point to find similar behaviors. This is largely because of the nature of the data; user agent strings consist of flexible text that can have many different combinations of elements. We might say that a data source like this has a lack of clear structure to work with, compared to something like a key-value pair, which could be much easier to work with.

Example user agent string:

Mozilla/5.0 (Linux; U; Android 4.0.4; en-us; GT-P5113 Build/IMM76D) 


AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30

This lack of structure makes it frustrating or difficult to ask any questions other than “Have we seen this specific combination of elements before?” or “Where else have I seen this specific user agent?”. Fuzzy matches and finding “similar” patterns is actually quite hard.

If we want to go further with our analysis, what questions might we need to answer?

What techniques do we use to sensibly parse and analyze this kind of data?
What insight can be gained from these analyses?
How do we scale to address the size of the challenge?

One way to get started with Splunk in applying advanced data-science methodologies to user agent string analyses is to use the Splunk App for Data Science and Deep Learning (DSDL). With this app connected to a container environment, we can provide access to a Jupyter Python interface that can be used to incorporate custom code and open-source libraries with great ease. The use case below provides example codes and notebooks for DSDL.

If we use the broader capabilities of a Python data-science environment such as DSDL we can take advantage of a broader range of more advanced analysis techniques. No longer are we limited to parsing our sections of user agents or matching particular words or patterns, instead we might take an approach like the one detailed in this blog from GreyNoise. This approach allows us to tokenize and encode a user agent string into a series of numbers (a vector) which is much more amenable to doing interesting analysis:

This methodology allows us to parse the user agent string (0) into a set of tokens or string elements (1). Each token is then hashed and converted to a number (2), producing a pseudo-random number specific to this particular token. Lastly, a numerical signature (3) is created using this list of numbers, a signature that can then be used to compare user agents.

This vector encoding preserves some information about the similarity of the text strings allowing the comparison of numerical signatures for similar text elements. To put it simply, this gets us to a situation where it becomes easier to ask useful questions about our data.

You can see this idea in the scatter diagram below. Similar user agents are in one corner and very different user agents are in another. Our vectors are the positions of these points and each point represents a single user agent’s encoded vector. In a real encoding example, we might have many more than 2 dimensions but this simple example demonstrates the key point: similar user agents have similar coordinates in our vector space so we can easily find agents that are similar.

We have created several Jupyter notebooks for use with DSDL for encoding and querying sets of user agents and have implemented that functionality. These are available in the 5.1.1 release. With these notebooks in place, we can now encode and compare the vectors of encoded text from Splunk Processing Language (SPL). Here’s what an encoding search might look like:

SPL:


index="dsdl_testing"


| table user_agent 


| fit MLTKContainer algo=hashing_encoder vector_length=8 user_agent 


| rename predicted_* as vector_*

Results:

This SPL takes the input field “user_agent” and converts it to an 8-dimensional vector in the fields “vector_0” to “vector_7”.

Now that We’ve got this representation, what can we do with it? Well, we can start to ask more useful questions like, “Can you show me any user agents that are similar to this one?” which you can see implemented in the “Reference Search” dashboard below.

At this point, the activity associated with these related user agents can be explored. Thus, pivoting around in this dataset becomes much easier. Imagine a situation in which you’ve discovered a user agent that relates to malicious activity, it now becomes very easy to find minor variations that might also relate to malicious activity.

It’s possible to take this idea even further using a technique like UMAP. UMAP allows us to take a set of numerical input vectors with high numbers of dimensions, exactly like our encoded user agents, and project it down into a 2-dimensional image to explore, much like our simplified example above, but on a much larger scale. Below you can see such an example where we’ve used clustering to identify groups of similar user agents. This could be a valuable tool in the face of changing behaviors to drill down and find related activities.

Finally, let's think about scale. For problems that fit nicely within a single container, we’ve demonstrated an approach in DSDL above. I’ve tested 10’s of thousands of user agents and had no problems processing this data on very modest hardware. For infrequent use cases, this might be sufficient. To go further we have some options. Doubtlessly the code in our example could be optimized and moved to GPU to scale comparisons with relative ease, but even for the biggest of problems we have options.

Below you’ll see the architecture of integration through DSDL with the open-source vector database Milvus. This permits the encoding and storing of vectors and labels in a Milvus collection and then enriching searches with results queried from Milvus.

Milvus is easy to set up in a container environment and scale to meet the needs of a use case, with the potential to handle many millions or billions of vectors if properly resourced. This prototype DSDL architecture can be found here and is deployed in docker with docker-compose in seconds.

This article demonstrates that encoding a complex data structure such as a user agent in a way that allows more sophisticated analysis is possible with Splunk and DSDL. We’ve created some useful workflows for a threat hunter conducting Model-Assisted Threat Hunting using the DSDL app and Splunk dashboards that make the process of finding related activity much simpler and more efficient. This needn't stop at user agents though, if you think about your day-to-day work how often do you find a field or an event where you want to ask: “Show me all the things that are like this, but not exactly the same”. In my experience, this sort of question can come up for any number of different types of data: usernames, TLS signatures, process names, URLs and URIs to name just a few. Implementing this kind of logic could drastically improve the flexibility with which you are able to dive in and explore to answer such questions.

Lastly, we’ve shown how an open-source vector database such as Milvus could be used to scale a use case to many millions or billions of events.

Code information and examples can be found on the DSDL GitHub page. For any questions or assistance, please contact me jcowling(at)splunk.com or reach out on LinkedIn.

I’d like to acknowledge the work and time of a number of amazing colleagues in researching, developing and refining these ideas:

Andrew Phillips
Ryan Fetterman
Philipp Drieger
Jay Etchings
Audra Streetman

Josh Cowling

Josh is a technologist, consultant, and entrepreneur based in London. Holding a PhD from Durham University's School of Engineering and Computing Sciences, he has wide experience spanning start-ups and enterprises in research, engineering, consulting, and pre-sales roles. While his background includes research, Josh is primarily focused on understanding, developing, and deploying new technologies that solve real problems and deliver tangible value. Connect with Josh on LinkedIn, especially if you have an interesting challenge in domains like cybersecurity, Splunk, data science, or machine learning.

Tips & Tricks 3 Min Read

Universal or Heavy, that is the question?

Universal or Heavy forwarder? What's the right fit for you and your needs? Splunk offers binaries for both. Just download Splunk and get started.

Tips & Tricks 5 Min Read

Splunking Microsoft Cloud Data: Part 3

A step-by-step guide for configuring and ingesting Exchange Online message tracking logs

Tips & Tricks 6 Min Read

Aw, HEC! Splunking Okta's Event Hooks with the HTTP Event Collector

Former Splunker and soon-to-be Okta employee James Brodsky tells you how to get Okta's Event Hook data into Splunk in 428 easy steps!

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram