One common misconception about machine learning methodologies is that they can completely remove the need for humans to understand the data they are working with. In reality, it can often place a greater burden on an analyst or engineer to ensure that their data meets the requirements, cleanliness and standardization assumed by the methodologies used. However, when the complexity of the data becomes significant, how is a human supposed to keep up? One methodology is to use ML to find ways to keep a human in the loop!
Dimensionality reduction methods such as PCA, tSNE and UMAP allow us to take complex, encoded datasets and reduce them down to diagrams that allow us to bring human intuition and understanding back into our processes.
In January at SANS CyberThreat2022(3), I will explain how these techniques can be applied to JA3 TLS Signatures. Collecting TLS signatures can help you to keep track of known, unknown and malicious software. In addition to this presentation, I'm working with the SURGe team at Splunk to build on our work of investigating the use of JA3 signatures to mitigate Supply Chain attacks.
In short, these dimensionality reduction techniques allow us to take a set of JA3 hashes and some of the information comprising these signatures them and turn them into a map to show the the space of software communications in a dataset:
In applying tSNE to generate this Petri dish-like representation of JA3 signatures from the dataset available at ja3er.com, we see a number of structures that emerge when we plot these signatures in a 2D space. Every blue point in this diagram is a unique signature. Many signatures together form the clouds and clusters seen in this diagram. Signatures that are similar are close together and those that are different are forced apart, creating a simple and intuitive 2D representation of a very complicated dataset!
By pulling in some labels for this space, we can start to identify regions of this map where malicious software congregates and use this as a visual aid when threat-hunting or observing new and recurring traffic in our environment. This diagram shows some labeled malicious JA3 signatures (red) against the ja3er.com dataset.
So, if we see lots of activity near these malicious points in the future, that might be worth examining, since those communications will share a lot of the same structure and features as these malicious communications.
It’s also possible to generate maps of smaller spaces where we compare and contrast the behaviors of multiple hosts. The following example uses UMAP to visualize the clusters of behaviors seen across five different hosts on a single day. Points in clusters or close to others represent either identical or very similar JA3 signatures, and we can clearly see anomalous behavior on the green host as it sits in its own separate cluster. Could it be that this host is using different, unpatched, out of date or malicious software? Time to investigate!
OK, cool. But what can I do with this in Splunk?
I’ve implemented an example of using JA3 signatures to classify host TLS behaviors as an example in the latest version of Splunk’s App for Data Science and Deep Learning (DSDL). So feel free to grab it and take a look.
However, I believe that these sorts of advanced dimensionality reduction techniques are likely to be useful well beyond this simple example. We can hopefully take some of the more general but very complex datasets we see often in security and make them far more accessible. If you’d like to dig in further or just chat about what’s possible, please feel free to reach out to me on LinkedIn.