Organizations lose billions of dollars to fraud each year. For instance, the financial services sector projects losses to reach $40 billion per year in the next 5-7 years unless financial institutions, merchants, and consumers become more diligent about fraud detection and prevention. Splunk delivers integrated enterprise fraud management software that quickly defines behavior patterns and protects enterprise information from malicious actors. In this blog post, we’ll explore an ML-powered solution using the Splunk Machine Learning Environment to detect fraudulent credit card transactions in real time. Using out-of-the-box Splunk capabilities, we’ll walk you through how to ingest and transform log data, train a predictive model using open source algorithms, and predict fraud in real-time against transaction events.
ML-Powered Fraud Detection
Traditionally, organizations have deployed rule-based fraud detection systems. However, with advances in fraudulent methods, these rule-based systems which require manual set-up and longer processing times have led to poor customer experience and weakened the security. An ML-based approach significantly simplifies or even automates the process and addresses the challenges of evolving, more nuanced fraud patterns.
Build A Solution With SMLE
Let’s explore a supervised machine learning approach to building a fraud classification solution using the Splunk Machine Learning Environment (SMLE). We’ll break down the solution into four steps:
- Ingest data from data sources of choice with Splunk’s SPL2 query language
- Extract features and transform data using convenient SPL2 operators
- Use popular and easy to integrate open source ML algorithms to build a classification model
Extract insights from the results and build a dashboard for monitoring and detection.
Step 1: Ingest Historical Transaction Logs Using SPL2
First, we start by streaming data into our pipeline. In our example, we pull data from an AWS S3 bucket where we have already uploaded a few months worth of transactions. These data have been labelled as fraudulent where appropriate. Here’s our SPL2 data pipeline that brings data in from the S3 bucket into our Jupyter notebook environment. The output is a series of raw credit card logs. Each row represents a transaction, and each transaction consists of signals represented by variables V0 through V28 as shown below.
Step 2: Explore and Transform Data Using SPL2 Operators
Before we dive into the nitty gritty of the ML pipeline, let’s first explore the data. In SMLE, we can use simple Python code that, as we show, can be seamlessly integrated with SPL2. First, we use Python to generate summary statistics about the number of fraudulent transactions we observe in the data we ingested — looking at the mean, standard deviation, and distribution of labelled fraudulent transactions.
The summary statistics confirm that the dataset contains labelled examples of fraudulent transactions.
What we’ll do next is identify which of the signals in each transaction record are significant in determining the probability of fraud. To do this, we’ll derive a correlation between pairs of features and plot a heatmap to give us visual cues.
Based on what we see in the heatmap, We see in the heatmap that the Class feature shows a high degree of correlation (lighter colors indicate higher correlation). This way, we can drop a few columns to simplify our dataset and set it up for our training step.
Step 3: Build a Classification Model Using the Historical Data
We’ll use Scikit-learn’s RandomForestClassifier algorithm to build a classification model that classifies a transaction as fraudulent or not learning from the supervised dataset we’ve analyzed so far. In the example below, we show a Python snippet that implements this step. The model we build can be published and exported to the broader Splunk ecosystem as we’ll see in step 4.
Step 4: Predict Fraud
Finally, we’ll deploy the trained model into an SPL2 pipeline that is capable of reading logs from a variety of data sources (S3, Splunk indexes, etc.) and applying the trained model to predict fraudulent transactions in real time. The snippet below shows the outcome of this step in the form of a table.
Beyond the capabilities we’ve demonstrated so far, Splunk’s AI/ML platform provides capabilities to build dashboards to identify these fraudulent transactions and set up automated workflows to ensure models are kept up to date with recently ingested data. Using SMLE’s MLOps capabilities, the deployed models and their associated operations can be managed and monitored from a single pane of glass.
End-to-End Solution with SMLE
We’ve demonstrated a solution for fraud detection above using SMLE (Splunk ML Environment), a platform for building and deploying ML at scale from within the Splunk ecosystem. By extending the features of Splunk that customers love with a suite of data science and operations capabilities, SMLE allows Splunk users and data scientists to collaborate on building solutions that involve a combination of SPL and ML libraries.
You got to see how to build a simple, real-time fraud detection solution with SPL2 and Scikit-learn in the SMLE platform. With a combination of powerful and easy-to-use SPL2 operators and flexibility of popular programming languages like Python, SMLE allows users to construct entire workflows with a sequence of SPL2 and ML operations. Stay tuned as we explore further use cases and highlight the range of capabilities that SMLE has to offer.
Interested in trying SMLE? Sign up for our beta program!
Curious about more SMLE use cases? Check out our last use case walkthrough on improving DevOps workflows and identifying anomalies on the stream.
This Splunk blog post was co-authored by Vinay Sridhar (main author), Senior Product Manager for Machine Learning, and Mohan Rajagopalan, Senior Director of Product Management for Machine Learning.