Observability

April 17, 2020

7 Minute Read

Understanding and Baselining Network Behaviour using Machine Learning - Part I

By Greg Ainslie-Malik

Managing a network more effectively has been something our customers have been asking us about for many years, but it has become an increasingly important topic as working from home becomes the new normal across the globe.

In this blog series, I thought I’d present a few analytical techniques that we have seen our customers deploy on their network data to:

Better understand their network
Develop baselines for network behaviour and detect anomalies

We’re going to use the Coburg Intrusion Detection Data Sets (CIDDS) to perform the analytics, where they have generated a dataset that mimics a ‘production’ environment that is under attack – so expect some odd behaviour patterns! Specifically, we’re going to use the Open Stack traffic logs from CIDDS-001 to perform the analytics, which are flat CSV files containing NetFlow logs.

You’ll have to take my word for it, but I’ve never used this dataset before and I have no idea what any of the network devices do… We did find some interesting conclusions from the analysis in this blog series, however, with many thanks for the help from Markus Ring and Sarah Wunderlich - two of the researchers who helped produce the dataset. We’ll outline some of the conclusions as we walk through the analysis (many of the conclusions are described in more detail in the CIDDS technical report).

Understanding Your Network

The exam question here is: Do you understand your network? Many of our customers struggle to identify the key nodes on their network, which are often the most critical for maintaining uptime.

A great way to discover your network topology is by using graph analytics. My colleague Philipp Drieger recently published an article describing how to use Splunk for graph analytics using the 3D Graph Network Topology Visualization app that he developed with Erica Pescio and some awesome folks at Siemens. Here I thought I’d present some examples of how you can use this app to perform network discovery.

Visualising Your Network

For demonstration purposes, we are going to use the CIDDS traffic data in this blog. In practice, you would run the following analytics on your own network data. Using the app, run the following search to visualise the network topology from the CIDDS data, where we are filtering the data to only show results that relate to internal IP addresses (those with a 192.168 prefix in this dataset):

| tstats count WHERE (index=cidds) BY "Src IP Addr" "Dst IP Addr"
| rename "Src IP Addr" as src_ip "Dst IP Addr" as dest_ip
| search src_ip=192.168.* dest_ip=192.168.*

CIDDS Data

Using the 3D Graph Network Topology viz you should get something like the image here.You can see that the IP address 192.168.220.15 is shown right in the middle, but other than that this is quite a difficult diagram to interpret.

Identifying Key Nodes on the Network

Now that we have the topology a bit more work is required to determine the key nodes on the network. In order to find the key nodes, we’re going to use the GraphCentrality algorithm, which ships with the app and contains a number of ways to measure centrality in a network. Here we are using a few centrality measures; namely eigenvector centrality and betweenness centrality. Don’t worry about the mathematics – all you need to know is the higher the coefficients determined by these methods the more important the IP address is to the network.

The following search will calculate the eigenvector centrality coefficients, which is a measure of how many other nodes are connected to a given source IP:

| tstats count WHERE (index=cidds) BY "Src IP Addr" "Dst IP Addr"
| rename "Src IP Addr" as src_ip "Dst IP Addr" as dest_ip
| search src_ip=192.168.* dest_ip=192.168.*
| fit GraphCentrality src_ip dest_ip compute="eigenvector_centrality"
| table src_ip eigenvector_centrality
| dedup src_ip
| sort 10 - eigenvector_centrality

We can see the top IP addresses by the eigenvector coefficient in the diagram here, where the IP addresses 192.168.220.15 and 192.168.220.16 appear to be connected to more nodes than any other source IP.

IP Address diagram

Applying the same search using the betweenness centrality measure (compute="betweenness_centrality") highlighted that 192.168.220.15 and 192.168.220.16 are critical nodes, as seen in the diagram below. In this case, the betweenness centrality is a measure of how many of the shortest routes through the network flow through a given IP, so essentially it is measuring how important a node is to the overall traffic flow across the network.

IP Address diagram

Reducing the Noise

We’ve now gained a better understanding of the most connected nodes on the network, but let's see if we can better visualise the structure of our network using one of the macros that come with the 3D graph viz app.

To see how these coefficients relate to a graph visualisation we’re going to apply some colouring and change some weightings to help the visualisation using the search below:

| tstats count WHERE (index=cidds) BY "Src IP Addr" "Dst IP Addr"
| rename "Src IP Addr" as src_ip "Dst IP Addr" as dest_ip
| search src_ip=192.168.* dest_ip=192.168.*
| fit GraphCentrality src_ip dest_ip compute="eigenvector_centrality,betweenness_centrality"
| eval col00 = "#00AA00"
| eval colX0 = "#FF0000"
| eval col0Y = "#FFFF00"
| eval colXY = "#FFA500"
`bilinearInterpolateColorGradient(betweenness_centrality, eigenvector_centrality, col00, colX0, col0Y, colXY, "color_src")`
| eval weight_dest=0.1, edge_weight=0.5
| sort - src_ip

Connected nodes We can now see a more meaningful network structure compared to our original diagram. Importantly, you can also see that the two IPs we identified using the centrality measures are highlighted in dark orange and appear to be connected to most nodes on the network.

Finding the ‘Spine’ of the Network

In our next search we are going to try to identify the minimum nodes required to span every point on the network. You can see in some of the diagrams above that there is a high number of cycles in the network. These are groups of nodes that are connected in a loop formation - especially looking at the number of shared connections between 192.168.220.15 and 192.168.220.16. Using the metrics we have calculated, we can assign weights to each connection and use these weights to determine the most important paths in the network.

The bulk of the search below is broadly the same as the previous search, but we have a few additional commands. Starting with the eventstats we are calculating the highest number of connections between two IP addresses across all of our data. We are then using this count along with the centrality coefficients to calculate a weight: critically the weight should be lowest for the most important connections. Note that if you do not provide a weight into the minimum spanning tree algorithm it will assume all connections have the same weight. We are then using this weighting to determine the minimum spanning tree across the network, in other words, to find the spine of the network.

| tstats count WHERE (index=cidds) BY "Src IP Addr" "Dst IP Addr"
| rename "Src IP Addr" as src_ip "Dst IP Addr" as dest_ip
| search src_ip=192.168.* dest_ip=192.168.*
| fit GraphCentrality src_ip dest_ip compute="eigenvector_centrality,betweenness_centrality"
| eval col00 = "#00AA00"
| eval colX0 = "#FF0000"
| eval col0Y = "#FFFF00"
| eval colXY = "#FFA500"
`bilinearInterpolateColorGradient(betweenness_centrality, eigenvector_centrality, col00, colX0, col0Y, colXY, "color_src")`
| eval weight_dest=0.5, edge_weight=1
| fields - col00 colX0 col0Y colXY
| eventstats max(count) as max_count
| eval weight=(1-(count/max_count))*(1-eigenvector_centrality)*(1-betweenness_centrality)
| fit MinimumSpanningTree src_ip dest_ip weight=weight
| sort - src_ip

You can see the results of this search in the chart below, where a much cleaner network graph can be seen. Note that there aren’t any cycles in our network now, and the two key IP addresses are still coloured in orange - where 192.168.220.15 sits in the middle of the main cluster of nodes. It is clear that we have a highly-connected network in our dataset due to the difference between this chart and the last.

Cluster of nodes

Grouping Network Devices

Another option - if you want to try and see if there are any similarities between your IP address endpoints - is to cluster your data based on some of the centrality metrics we just calculated. Disclaimer: this is a complex technique, so don’t worry if you want to skip on to the next section! To apply the technique we will calculate some additional statistics about our source IP addresses, scale and abstract the metrics we are interested in (the calculated metrics and centrality coefficients) and ultimately fit a clustering algorithm - KMeans - to see if there are any patterns in the data. The search for this is below:

| tstats count WHERE (index=cidds) BY "Src IP Addr" "Dst IP Addr"
| rename "Src IP Addr" as src_ip "Dst IP Addr" as dest_ip
| search src_ip=192.168.* dest_ip=192.168.*
| fit GraphCentrality src_ip dest_ip compute="eigenvector_centrality,betweenness_centrality"
| eventstats avg(count) as avg_count dc(dest_ip) as distinct_connections by src_ip 
| fit StandardScaler avg_count distinct_connections eigenvector_centrality betweenness_centrality
| fit PCA SS_* k=3
| fit KMeans PC_* k=4

Cluster

We can see from the diagram that we have identified a few clusters in the data. The blue cluster - cluster number: 0 - contains two IP addresses, 192.168.220.15 and 192.168.220.16. This further reinforces what we have discovered already, that they have similar behaviour. Further investigation using straightforward techniques in Splunk around the IP addresses within each cluster may also identify their own specific functions.

Summary

To conclude this half of the blog the IP addresses that we have highlighted as being critical - the IP addresses that appear to be the most connected - are actually the IPs that were used as the internal attack servers in the CIDDS data! These two IPs are developer servers that got attacked by a Botnet and ran a series of port scan, ping scan, DoS and brute force attacks on other internal servers after their initial infection. So although we have been taking the perspective of network operations this type of analysis can also be extremely valuable for security as well. Thanks to Markus and Sarah of Coburg University for confirming this information.

We have now walked through a few techniques using graph analytics and clustering to understand the importance, connectivity and grouping of nodes on the network. Now it’s over to you to apply these searches to your own network data to see if you can find the most important nodes in your infrastructure (or perhaps those that have been compromised)!

Keep reading the second half of the blog to see a few techniques for generating baseline behaviour for your network. Also stay posted for updates in the near future where we will be exposing many of these techniques in the next release of the 3D Graph Network Topology Visualization app in an experiment framework so that you can build your own analytics on your data.

Happy Splunking.

P.s. Read Part II here.

Special thanks to Philipp Drieger and Bryan Sadowski for their help collating the material for this blog series, and also to Markus Ring and Sarah Wunderlich from Coburg University for their valuable insight into the CIDDS data and their ongoing research.

Greg Ainslie-Malik

Greg is a recovering mathematician and part of the technical advisory team at Splunk, specialising in how to get value from machine learning and advanced analytics. Previously the product manager for Splunk’s Machine Learning Toolkit (MLTK) he helped set the strategy for machine learning in the core Splunk platform. A particular career highlight was partnering with the World Economic Forum to provide subject matter expertise on the AI Procurement in a Box project.

Before working at Splunk he spent a number of years with Deloitte and prior to that BAE Systems Detica working as a data scientist. Ahead of getting a proper job he spent way too long at university collecting degrees in maths including a PhD on “Mathematical Analysis of PWM Processes”.

When he is not at work he is usually herding his three young lads around while thinking that work is significantly more relaxing than being at home…

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram