Add To Chrome? - Part 4: Threat Hunting in 3-Dimensions: M-ATH in the Chrome Web Store

Welcome to the final installment in our “Add to Chrome?” research! In this post, we'll experiment with a method to find masquerading, or suspicious clusters of Chrome extensions using Model-Assisted Threat Hunting (M-ATH) with Splunk and the Data Science & Deep Learning (DSDL) App. M-ATH is a SURGe-developed method from the PEAK framework, which uses models or algorithms to help find threat-hunting leads, or to help make complex problems more approachable.

M-ATH is well-suited for identifying suspicious Chrome extensions because of the complexity of extensions, and the scale of the Web Store. Chrome extensions are not actually singular components, but compilations of multiple files and file types, written in different programming languages and designed for diverse functions, e.g., language translation, ad-blocking, or password management. The Chrome Web Store hosts more than 140,000 browser extensions for the Chrome Web Browser. The extension contents are packaged in compressed “.crx” files, which can contain the file types, scripting languages, and metadata required to make the extension work.

Confirming an extension is malicious typically requires reverse-engineering effort, and is often dependent on the context in which the extension is purportedly operating. We can narrow down this haystack using M-ATH methods to identify Chrome Extensions worth further analysis. To see everything in action, or to follow along, we’ve made the full data corpus and notebook available on our GitHub and Google Collab!

Background

Masquerading is a critical challenge in any curated app store, open-source marketplace or package repository. In masquerading efforts, attackers will subvert the user’s trust assumptions by creating stand-in apps that look like, or sound like a trusted or well-understood service, backdooring the files in many cases to steal personal/private information, or implant malware onto the user’s machine.

Our M-ATH approach operates under the hypothesis that if we start with a popular ‘target extension’ we can use similarity measures to find suspicious extensions that try to imitate or masquerade the target extension’s likeness. As an example, we’ll start with the popular ‘Google Translate’ extension, which integrates a ‘right-click’ option to quickly translate text in your browser from different languages.

Data

To start, we need good data that represents the contents and attributes of the extensions hosted on the Web Store. The SURGe Team (with the approval of the Google security team) scraped the contents of the Chrome Web Store to construct a baseline dataset of the following:

Field

Description

name

Name of the Chrome Extension

description

Description of the functionality of the browser extension.

crx

The ‘.crx’ file is a packaged Chrome Browser extension. Each CRX is assigned a unique hash value by Google, for tracking.

Extension File Contents

The CRX package contains multiple files to support the functionality of the extension. These can include e.g., JavaScript files, HTML files, images, manifest file (manifest.json), and more.

Next, this data was enriched with some additional metrics:

Color Moment Hash: A method used to create a hash representation of an image based on its color moments. It aims to capture the statistical information of an image – in this case, the hash is applied to the extensions logo file (pulled from the full extension file contents), and stored in a field called image_cmhash.

Approach

Our hunting approach relies on aspects we’d expect to see from a masquerading extension – it would attempt to look like, sound like, and/or be described like our target extension. How do we turn these ideas into measurements? Using our baseline data and the hashes we’ve added, we have several methods at our disposal:

Levenshtein similarity can measure the likeness between Extension Names
Color Moment similarity can measure the likeness between Extension Icons
Cosine Similarity can measure the likeness between Extension descriptions
Unsupervised Learning can cluster and visualize the attributes of our extensions in a 3-D Scatterplot
Euclidean Distance can measure the distance between vector representations of our extensions to serve as a composite similarity score.

With these ideas in hand and a target browser extension (Google Translate) to start our hunting, we can quickly identify the next most similar in appearance, name, and description from the sea of more than 140,000 extensions. First, let’s break these down a bit more with some examples.

Naming Similarity

The Levenshtein Similarity is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. In this case, we invert the metric (i.e. converting distance to similarity), to match the orientation of our other similarity metrics. If we sort by our Levenshtein metric, we can quickly find the most similarly-named extensions, and their unique crx identifiers for pivoting to the full file contents.

Icon Similarity

Icon similarity is assessed using Hamming Similarity between the Color Moment Hash of two extensions. Color Moment Hash is a compact representation (a hash) of an image, based on the statistical moments of its color components. This hash is valuable for image comparison because it encapsulates significant information about the image's color distribution while being relatively insensitive to small changes or distortions in the image. Measuring the similarity between color moment hashes quickly locates exact matches to our target icon, or similar but slightly modified versions:

Description Similarity

Cosine Similarity is a metric used to compare the similarity of the description fields of text. Each text is represented as a vector (numerical encoding), where each dimension corresponds to a word from the combined set of words in both texts, and the value in each dimension corresponds to the weight of that word in the text. Cosine similarity is then used to find the cosine of the angle between these two vectors, which reflects similarity between the text fields, e.g.:

Aggregate Similarity Analysis

We used the previous three measures to hunt for similarity between names, descriptions and icon appearance individually, but a true masquerading extension could combine more than one of these attributes. To better assess our similar extensions as a whole, we will combine the top results from each category, cluster them with K-means, and then measure the distance from each point to our target vector (Google Translate), using Euclidean distance.

K-means is an unsupervised learning algorithm that partitions a dataset into k distinct, non-overlapping clusters based on the attributes of the data points – in this case, the Levenshtein, Cosine and Color Moment Hamming similarity measures. The algorithm aims to minimize the within-cluster variances and maximize the between-cluster variances, meaning that it seeks to create clusters where members of the same cluster are as similar as possible while also being as different as possible from members of other clusters. By assigning colors to different cluster values, we can visualize our data in a three-dimensional scatter plot, across each of their similarity measures.

^{3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures}

Conclusion

In summary, we demonstrated how we can apply hashing and vectorization to encode and enrich our dataset, use algorithms to measure the similarity between vectors and use modeling to combine and cluster multiple elements of similarity. This approach allows us to start with any single extension, and quickly reduce our dataset of 140,000 extensions down to a handful of candidates for deeper reverse engineering analysis.

^{3-D Scatter Plot of Levenshtein, Color Moment, & Cosine Similarity Measures}

The notebook and data we’ve released create an independent way to explore and analyze Chrome Web Store extension data yourself. In addition, the process and metrics can be transferred to a variety of different Model-Assisted Threat Hunting or data exploration challenges using Splunk AI and DSDL. Can this approach be applied to find new threats in your environment?

Stay tuned for more updates from the SURGe team, and Happy Hunting!

Style

two-column

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

Security

12 Minute Read

Predicting Cyber Fraud Through Real-World Events: Insights from Domain Registration Trends

By analyzing new domain registrations around major real-world events, researchers show how fraud campaigns take shape early, helping defenders spot threats before scams surface.

Security

4 Minute Read

When Your Fraud Detection Tool Doubles as a Wellness Check: The Unexpected Intersection of Security and HR

Behavioral analytics can spot fraud and burnout. With UEBA built into Splunk ES Premier, one data set helps security and HR reduce risk, retain talent, faster.

Security

1 Minute Read

Splunk Security Content for Threat Detection & Response: November Recap

Discover Splunk's November security content updates, featuring enhanced Castle RAT threat detection, UAC bypass analytics, and deeper insights for validating detections on research.splunk.com.

Security

2 Minute Read

Security Staff Picks To Read This Month, Handpicked by Splunk Experts

Our Splunk security experts share their favorite reads of the month so you can follow the most interesting, news-worthy, and innovative stories coming from the wide world of cybersecurity.

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Security

10 Minute Read

Behind the Walls: Techniques and Tactics in Castle RAT Client Malware

Uncover CastleRAT malware's techniques (TTPs) and learn how to build Splunk detections using MITRE ATT&CK. Protect your network from this advanced RAT.

Security

12 Minute Read

AI for Humans: A Beginner’s Field Guide

Unlock AI with the our beginner's field guide. Demystify LLMs, Generative AI, and Agentic AI, exploring their evolution and critical cybersecurity applications.

Security

5 Minute Read

Splunk Security Content for Threat Detection & Response: November 2025 Update

Learn about the latest security content from Splunk.

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

Security

3 Minute Read

Operation Defend the North: What High-Pressure Cyber Exercises Teach Us About Resilience and How OneCisco Elevates It

The OneCisco approach is not about any single platform or toolset; it's about fusing visibility, analytics, and automation into a shared source of operational truth so that teams can act decisively, even in the fog of crisis.

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Security

5 Minute Read

Data Fit for a Sovereign: How to Consider Sovereignty in Your Digital Resilience Strategy

Explore how digital sovereignty shapes resilient strategies for European organisations. Learn how to balance control, compliance, and agility in your data infrastructure with Cisco and Splunk’s flexible, secure solutions for the AI era.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Add To Chrome? - Part 4: Threat Hunting in 3-Dimensions: M-ATH in the Chrome Web Store

Background

Data

Approach

Naming Similarity

Icon Similarity

Description Similarity

Aggregate Similarity Analysis

Conclusion

Related Articles