Analyzing BotNets with Suricata & Machine Learning

Since the official rollout at the year’s. conf of the Machine Learning Toolkit(MLTK), Splunkers have been pursing some interesting use cases ranging from IT operations, planning, security and business analytics. Those use cases barely scratch the surface of what is possible with machine learning and Splunk. As an example, I will use the machine learning toolkit and data collected from Suricata to analyze botnet populations. This population analysis will be used to create a model for predicting the Mirai botnet based on network features.

Suricata

Suricata is an open source threat detection engine, which can be run in passive mode for intrusion detection or inline for intrusion prevention. My lab environment is configured for intrusion detection, meaning Suricata will not make any attempt to prevent an intruder from accessing my system. This is a “good” thing because the behavioral signature of Mirai (and variants) use specific usernames for IoT devices found in the scanner.c module in the sshd logs & telnet sessions of the server it attempts to infect.

scanner-c

Analysis

The analysis largely builds upon the previous blog post (Analyzing the Mirai Botnet with Splunk), which correlated the failed logins of specific usernames and ip addresses. This threatlist of suspected Mirai ip addresses can be analyzed for various features such as geography, IANA registration, frequency, etc…

mirai_botnet_access_attempts_map

Combining this threatlist with our passive intrusion detection netflow data creates an enriched dataset for building a model. Adding, contextual and detailed information about each access attempt at the packet level provides insights into the activity attempted by that IP or block of IPs during a 24 hour window. As an example, we can determine which tcp flags were present in the packets both client side & server side in each flow transaction and begin grouping similar events together. We can also create a ratio of packets_in v. packets_out and classify these flows into various producer consumer ratio(PCR) categories.

Mirai analysis pcr categories

MLTK

The MLTK is handy because of the many assistants which ship with the tool, you don’t need to know the exact SPL syntax to begin making use of it. Using the clustering assistant, I can attempt to discern different botnet populations based on the features present in my dataset. In the below example, I have selected 50k random Suricata flow events, where the dest_port is 22. I have picked features which have some relation to each other, but are enriched by the PCR metric. I have created a label of isMirai with possible values of 0 | 1, depending on the IP address associated with that flow event. I have opted for Kmeans clustering with a value of k=5.

kmeans_clusters_botnet_activity

Interestingly, a clear visual pattern emerges with cluster_4. It is clearly an outlier compared to the rest of the population, but is there anything special about it? From an isMirai 0 | 1, perspective there is a mixture of both 1’s and 0’s. The packet_pcr_range, is 3:1 Import, with varying ratios, which seems to be the only common feature of cluster_4.

kmeans_apply_model_filter_cluster4

Using a model for prediction

MLTK isn’t intended to create models for the sake of creating models, it also allows you to operationalize those models for predicting based on features found in the model, one such feature we get from kmeans is the cluster_distance. This number describes the distance an event is from the centroid.

Using the prediction assistant, the Kmeans model can be loaded in search before selecting features from the dropdown to use for prediction. We can then select the features we wish to use for prediction: cluster_distance, packet_pcr_range, packet_ratio, and packet_total. The prediction assistant also gives you the ability to adjust the specific algorithm to use for prediction, I have opted for Random Forest.

using_kmeans_apply_use_random_forest

Next Steps

The model appears to be very good at predicting 0 (not Mirai), while it is reasonably good at predicting 1 (89.4%). This is an improvement over Suricata, which did not detect Mirai with the emerging threats ruleset. This may imply that there is an indicator of compromise for the Mirai botnet at the packet level. Proof of this requires further investigation, and independent validation to understand why the model can predict Mirai so effectively to eliminate bias or mistakes. Collaboration with others who have also gathered traffic from botnets is a great way to validate the model against a data set it has not seen before to further validate. If an indicator of compromise can be discerned from this analysis it could be converted into an IDS signature for future detection of Mirai infection attempts.

using_kmeans_apply_use_random_forest_accuracy

----------------------------------------------------
Thanks!
Anthony Tellez

Related Articles

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease
Platform
2 Minute Read

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease

Splunk POD is designed to simplify your on-premises data analytics, so you can focus on what really matters: making smarter, faster decisions that drive your business forward.
Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights
Platform
3 Minute Read

Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights

Announcing the general availability of the new workload dashboard – a modern and intuitive dashboard experience in the Cloud Monitoring Console app.
Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ
Platform
5 Minute Read

Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ

The heart of our momentum at Cisco Live APJ is our deeper integration with Cisco, culminating in the Splunk POD and new integrations, delivering unified, next-generation data operations for every organization.
Dashboard Studio: Token Eval and Conditional Panel Visibility
Platform
4 Minute Read

Dashboard Studio: Token Eval and Conditional Panel Visibility

Dashboard Studio in Splunk Cloud Platform can address more complex use cases with conditional panel visibility, token eval, and custom visualizations support.
Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard
Platform
4 Minute Read

Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard

Introducing Resource Metrics in Workload Dashboard (WLD) – a modern and intuitive monitoring experience in the Cloud Monitoring Console (CMC) app.
Powering AI Innovation with Splunk: Meet the Cisco Data Fabric
Platform
3 Minute Read

Powering AI Innovation with Splunk: Meet the Cisco Data Fabric

The Cisco Data Fabric brings AI-centric advancements to the Splunk Platform, seamlessly connecting knowledge, business, and machine data.
Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades
Platform
3 Minute Read

Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades

Simplify fleet-wide upgrades of Windows Universal Forwarders with Splunk Remote Upgrader—centralized, signed, secure updates with rollback, config preservation, and audit logs.
Dashboard Studio: Spec-TAB-ular Updates
Platform
3 Minute Read

Dashboard Studio: Spec-TAB-ular Updates

Splunk Cloud Platform 10.0.2503 includes a number of enhancements related to tabbed dashboards, trellis for more charts, and more!
Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises
Platform
2 Minute Read

Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises

Announcing the introduction of Edge Processor for Splunk Enterprise 10.0, designed to help customers achieve greater efficiencies in data transformation and improved visibility into data in motion.