Detecting Credit Card Fraud Using SMLE

Organizations lose billions of dollars to fraud each year. For instance, the financial services sector projects losses to reach $40 billion per year in the next 5-7 years unless financial institutions, merchants, and consumers become more diligent about fraud detection and prevention. Splunk delivers integrated enterprise fraud management software that quickly defines behavior patterns and protects enterprise information from malicious actors. In this blog post, we’ll explore an ML-powered solution using the Splunk Machine Learning Environment to detect fraudulent credit card transactions in real time. Using out-of-the-box Splunk capabilities, we’ll walk you through how to ingest and transform log data, train a predictive model using open source algorithms, and predict fraud in real-time against transaction events.

ML-Powered Fraud Detection

Traditionally, organizations have deployed rule-based fraud detection systems. However, with advances in fraudulent methods, these rule-based systems which require manual set-up and longer processing times have led to poor customer experience and weakened the security. An ML-based approach significantly simplifies or even automates the process and addresses the challenges of evolving, more nuanced fraud patterns.

Build A Solution With SMLE

Let’s explore a supervised machine learning approach to building a fraud classification solution using the Splunk Machine Learning Environment (SMLE). We’ll break down the solution into four steps:

  1. Ingest data from data sources of choice with Splunk’s SPL2 query language
  2. Extract features and transform data using convenient SPL2 operators
  3. Use popular and easy to integrate open source ML algorithms to build a classification model

Extract insights from the results and build a dashboard for monitoring and detection.

Step 1: Ingest Historical Transaction Logs Using SPL2 First, we start by streaming data into our pipeline. In our example, we pull data from an AWS S3 bucket where we have already uploaded a few months worth of transactions. These data have been labelled as fraudulent where appropriate. Here’s our SPL2 data pipeline that brings data in from the S3 bucket into our Jupyter notebook environment. The output is a series of raw credit card logs. Each row represents a transaction, and each transaction consists of signals represented by variables V0 through V28 as shown below.

Step 2: Explore and Transform Data Using SPL2 Operators Before we dive into the nitty gritty of the ML pipeline, let’s first explore the data. In SMLE, we can use simple Python code that, as we show, can be seamlessly integrated with SPL2. First, we use Python to generate summary statistics about the number of fraudulent transactions we observe in the data we ingested — looking at the mean, standard deviation, and distribution of labelled fraudulent transactions.

The summary statistics confirm that the dataset contains labelled examples of fraudulent transactions.

What we’ll do next is identify which of the signals in each transaction record are significant in determining the probability of fraud. To do this, we’ll derive a correlation between pairs of features and plot a heatmap to give us visual cues.

Based on what we see in the heatmap, We see in the heatmap that the Class feature shows a high degree of correlation (lighter colors indicate higher correlation). This way, we can drop a few columns to simplify our dataset and set it up for our training step.

Step 3: Build a Classification Model Using the Historical Data We’ll use Scikit-learn’s RandomForestClassifier algorithm to build a classification model that classifies a transaction as fraudulent or not learning from the supervised dataset we’ve analyzed so far. In the example below, we show a Python snippet that implements this step. The model we build can be published and exported to the broader Splunk ecosystem as we’ll see in step 4.


Step 4: Predict Fraud
Finally, we’ll deploy the trained model into an SPL2 pipeline that is capable of reading logs from a variety of data sources (S3, Splunk indexes, etc.) and applying the trained model to predict fraudulent transactions in real time. The snippet below shows the outcome of this step in the form of a table.


Beyond the capabilities we’ve demonstrated so far, Splunk’s AI/ML platform provides capabilities to build dashboards to identify these fraudulent transactions and set up automated workflows to ensure models are kept up to date with recently ingested data. Using SMLE’s MLOps capabilities, the deployed models and their associated operations can be managed and monitored from a single pane of glass.

End-to-End Solution with SMLE

We’ve demonstrated a solution for fraud detection above using SMLE (Splunk ML Environment), a platform for building and deploying ML at scale from within the Splunk ecosystem. By extending the features of Splunk that customers love with a suite of data science and operations capabilities, SMLE allows Splunk users and data scientists to collaborate on building solutions that involve a combination of SPL and ML libraries.

We are excited to offer a beta version of SMLE — sign up now to get started! To learn more about offerings and announcements from Splunk Machine Learning, check out these Splunk Blogs posts.

Wrapping Up

You got to see how to build a simple, real-time fraud detection solution with SPL2 and Scikit-learn in the SMLE platform. With a combination of powerful and easy-to-use SPL2 operators and flexibility of popular programming languages like Python, SMLE allows users to construct entire workflows with a sequence of SPL2 and ML operations. Stay tuned as we explore further use cases and highlight the range of capabilities that SMLE has to offer.

Interested in trying SMLE? Sign up for our beta program!

Curious about more SMLE use cases? Check out our last use case walkthrough on improving DevOps workflows and identifying anomalies on the stream.

This Splunk blog post was co-authored by Vinay Sridhar (main author), Senior Product Manager for Machine Learning, and Mohan Rajagopalan, Senior Director of Product Management for Machine Learning.

----------------------------------------------------
Thanks!
Mohan Rajagopalan

Related Articles

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease
Platform
2 Minute Read

Announcing the General Availability of Splunk POD: Unlock the Power of Your Data with Ease

Splunk POD is designed to simplify your on-premises data analytics, so you can focus on what really matters: making smarter, faster decisions that drive your business forward.
Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights
Platform
3 Minute Read

Introducing the New Workload Dashboard: Enhanced Visibility, Faster Troubleshooting, and Deeper Insights

Announcing the general availability of the new workload dashboard – a modern and intuitive dashboard experience in the Cloud Monitoring Console app.
Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ
Platform
5 Minute Read

Leading the Agentic AI Era: The Splunk Platform at Cisco Live APJ

The heart of our momentum at Cisco Live APJ is our deeper integration with Cisco, culminating in the Splunk POD and new integrations, delivering unified, next-generation data operations for every organization.
Dashboard Studio: Token Eval and Conditional Panel Visibility
Platform
4 Minute Read

Dashboard Studio: Token Eval and Conditional Panel Visibility

Dashboard Studio in Splunk Cloud Platform can address more complex use cases with conditional panel visibility, token eval, and custom visualizations support.
Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard
Platform
4 Minute Read

Introducing Resource Metrics: Elevate Your Insights with the New Workload Dashboard

Introducing Resource Metrics in Workload Dashboard (WLD) – a modern and intuitive monitoring experience in the Cloud Monitoring Console (CMC) app.
Powering AI Innovation with Splunk: Meet the Cisco Data Fabric
Platform
3 Minute Read

Powering AI Innovation with Splunk: Meet the Cisco Data Fabric

The Cisco Data Fabric brings AI-centric advancements to the Splunk Platform, seamlessly connecting knowledge, business, and machine data.
Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades
Platform
3 Minute Read

Remote Upgrader for Windows Is Here: Simplifying Fleet-Wide Forwarder Upgrades

Simplify fleet-wide upgrades of Windows Universal Forwarders with Splunk Remote Upgrader—centralized, signed, secure updates with rollback, config preservation, and audit logs.
Dashboard Studio: Spec-TAB-ular Updates
Platform
3 Minute Read

Dashboard Studio: Spec-TAB-ular Updates

Splunk Cloud Platform 10.0.2503 includes a number of enhancements related to tabbed dashboards, trellis for more charts, and more!
Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises
Platform
2 Minute Read

Introducing Edge Processor for Splunk Enterprise: Data Management on Your Premises

Announcing the introduction of Edge Processor for Splunk Enterprise 10.0, designed to help customers achieve greater efficiencies in data transformation and improved visibility into data in motion.