Creating a Fraud Risk Scoring Model Leveraging Data Pipelines and Machine Learning with Splunk

According to the Association of Certified Fraud Examiners, the money lost by businesses to fraudsters amounts to over $3.5 trillion each year. The ACFE's 2016  Report to the Nations on Occupational Fraud and Abuse states that proactive data monitoring and analysis is among the most effective anti-fraud controls. Organizations that undertake proactive data analysis techniques, experience frauds that are up to 54% less costly and 50% shorter than organizations that do not monitor and analyze data for signs of fraud. As fraudsters continue to adapt and utilize new methods, it is important to leverage machine learning and data science algorithms to fight fraud. Detecting anomalies and outliers through machine learning, utilizing adaptive thresholds and other advanced techniques are the next wave in fraud detection and prevention. What about carrying out those advanced analytics with the help of Splunk’s Data-to-Everything platform and support clients to reduce fraud impact? Even in the case that ACFE figures are exaggerated, the market opportunity is huge.

Here at Splunk, we have helped many customers across a range of different industries in their fight against fraud: whether that is to help detect financial fraud as described here by Haider or to support in the fight against opioid abuse through monitoring for fraudulent diversion of controlled substances. More recently we have also shown how you can detect fraudulent credit card transactions using our new Splunk Machine Learning Environment (SMLE). Furthermore, we also offer the free Splunk Security Essentials for Fraud Detection app that covers many use cases such as healthcare insurance billing, payments, etc.

During the previous months, Splunk has been approached by various multinational sports betting companies to help them streamline their fraud prevention and detection processes to provide the Revenue Assurance team with a 360º view of fraud.

One of the new necessities we came across several times was that the clients were seeking to get a sports betting fraud risk scoring model to be able to quickly detect fraud. For that purpose, I designed a data pipeline to create a sports betting fraud risk scoring model based on anomaly detection algorithms built with probability density function powered by Splunk’s Machine Learning Toolkit.

This article showcases the solution that can be built with Splunk in very little time with the application of your clients’ current data. 

Credit to Greg Anslie and Raúl Marín for their valuable help in the design of the pipeline and their wise insights while I was creating this content. Thank you guys!

The Solution

The plan to carry out the solution setup is as follows:

  1. Indexing Sports Betting Event Data in a Splunk Cloud Environment
  2. Correlation of data sources, enrichment and generation of KPIs
  3. Fraud risk scoring model training using ML
  4. Application of the scoring model on sports events
  5. Generation of control dashboards

To accelerate the time to value in the proposed solution, Splunk indexes data exported from the relational databases that contain data about the different sports events. This data would have been stored through a traditional batch ETL process which transformed the sources' raw data into traditional SQL-type tables. At the end of this article, a set of next steps will be suggested including accessing directly to the data sources without an intermediate SQL DB.

A common practice when developing ML models is to divide your data set into two pieces, one for ML training and another to test the model itself. In this case, let’s imagine that we have 12 months worth of data and that we will use the first 11 months for the model training and 1 month for model testing.

The data pipeline that performs data indexing, transformation, ML model training, ML model application and finally provides dashboarding and investigation capabilities are as follows:

Data pipeline

Note that since the data enters Splunk, the data pipeline depicts transformations in the data itself and not the underlying HW/SW architecture. In order to perform the various transformations in the data (indexing, enriching, summary, ML) only the indexer and search head components of Splunk’s super scalable architecture are required and only SPL language will be needed across the data pipeline. These make the solution simpler to build and maintain than other data pipelines made of different pieces of SW.

Step 1: Indexing Sports Betting Event Data in a Splunk Cloud Environment

In the following figure you can see the part of the pipeline to which this section is dedicated:

Data pipeline: data ingestion

The data ingestion will be carried out in Splunk from a set of sports betting database tables. There are many ways to perform this but the most common one is to use the Splunk DB Connect app, which helps you quickly integrate structured data sources with your Splunk real-time machine data collection. Database import functionality from the Splunk DB Connect app allows you to import tables, rows, and columns from a database directly into Splunk Enterprise, which indexes the data. You can then analyze and visualize that relational data from within Splunk Enterprise just as you would the rest of your Splunk Enterprise data.

Step 2: Correlation of Data Sources, Enrichment and Generation of KPIs

In the following figure you can see the part of the pipeline to which this section is dedicated:

Data pipeline: correlation, enrichment and KPIs calculation

Each data source will be an SQL-type table. After analyzing the relationship between tables the correlations will be performed. Note an example of relationships between fields of different tables in a SQL-type table. 

Sports betting data model

These correlations will be made entirely in Splunk through basic SPL commands. As several fields need to be correlated from several tables the chosen option is using eventstats and stats commands, relating fields from one table to another with eval command. SPL language is perfectly suited for correlating time series and the number of lines of code needed will be exponentially lower than those if using SQL language for performing those correlations. If you are interested in knowing the code structure for performing the correlation have a look at this example took from Splunk answers to correlate different fields from 3 tables:


Sample code:

KPIs can be calculated using the eval command. An example of a subset of KPIs calculation:

Data can be enriched with useful information such as betting room name and geolocation  of the betting rooms provided by the client using Splunk’s lookup functionality which allows you to enrich your event data by adding field-value combinations from lookup tables:

From the correlated data, a set of KPIs like the following can be constructed per sports event (note that this is just an example of interesting KPIs):


TAKE (%)




# BETS > 100/200€













Sports events KPIs

The correlated, enriched data and their KPIs at the sports event level should be transferred to a new summary index to accelerate the consumption of analytics by dashboards and machine learning algorithms using the collect command, but the index with the raw data will remain intact since it could be necessary for the realization of investigative searches:

Step 3: Fraud Risk Scoring Model Training using ML

In the following figure you can see the part of the pipeline to which this section is dedicated:

Data pipeline: fraud scoring model training

Now we will create a fraud risk scoring model based on anomaly detection in the different KPIs calculated in the previous section. To do that we will take 11 months of data and train the anomaly detection model. The ML tool to be used will be Splunk's Machine Learning Toolkit. The anomaly detector will be created for each KPI and each league based on its probability density function.

The probability density function determines the probability of a value being in a certain range based on past information. Basically, it generates a baseline for your data. This makes it a great tool for finding anomalies as it allows you to quickly determine if data sits in an expected range or not and you can find out more about this algorithm at this blog about finding anomalies with Splunk.

A sample code to generate the baselining for each KPI taking 11 months of data, using the fit command and evaluating the summary index created in the previous section is something like this (only a small subset of the KPIs has been included for simplicity reasons):

Note that each KPI modeled with the fit command indeed will create one submodel per each different League. That makes sense because each League has different betting patterns, therefore different probability density functions and finally different baselinings. More parameters like "League" can be added to have a better granularity of behaviours for each KPI. But keep in mind the right balance between computing efficiency and granularity.

So once we have our baselining for each KPI and each league that will allow us to detect anomalies and in the next step we will create a scoring based on the number of anomalies of the event that will account for the fraud risk. The idea behind this is simple: the more anomalies in the sport event KPIs, the bigger the risk of fraud.

Step 4: Application of the Scoring Model on Sports Events

In the following figure you can see the part of the pipeline to which this section is dedicated:

Data pipeline: fraud scoring model application

At this point, the anomaly detector will be tested with 1 month of data not used in its training. For each event, a score will be generated that accounts for its fraud risk by adding the anomalies detected in the different KPIs of the event. Let’s see some examples to make it clearer:

  • An Event 1: with anomalies detected in #BETS, #BETS > 100/200€% € WINNING BETS will get a fraud risk score equals 3.
  • An Event 2: with no anomalies detected will get a fraud risk score equals 0.
  • An Event 3: with anomalies detected in # LOST BETS, % € WINNING BETS# BETS > 100/200€TAKE (%), # DIFFERENT FORECASTS, % IDENTIFIED BETS will get a fraud risk score equals 6.

To fine-tune the scoring model, each anomaly KPI should have different weights based on its relative impact on fraud risk. For example an anomaly associated with #BETS could have a weight of 1.5 and an anomaly associated with  # BETS > 100/200€ could have a weight of 2.  As a first approximation, all KPIs have weight 1.

A sample code to use the apply command to the remaining 1 month of data to test the model (again this code will include only a small subset of the KPIs for simplicity reasons):

The sample code for the creation of the fraud risk score with the outputs of the apply command  by using eval command

Step 5: Generation of Control Dashboards:

In the following figure you can see the part of the pipeline to which this section is dedicated:

Data pipeline: control dashboards generation

Through Splunk's dashboarding capabilities, two dashboards have been generated:

  1. Sports Betting Fraud Dashboard
  2. Dashboard detail of KPIs by sports event

Some example snapshots of the dashboards that I generated with simulation data:

Sports Betting Fraud Dashboard:

Dashboard detail of KPIs by sports event:

What should come next?

The benefits proven during the exercise have been the following:

  • Splunk capabilities for exploratory data analysis
  • Easy-to-use Machine Learning Toolkit Splunk capabilities
  • Unique platform to store raw data, transform and store processed data and visualize it
  • Raw data always available, facilitating the implementation of new use cases
  • The power of SPL to work easily with time series and correlate many different data

On the other hand the solution proposed is not intended to be the final production solution but a first setup to accelerate time to value. As explained in the proposed solution, Splunk indexes data exported from the relational databases that contain data about the different sport events that would have been stored through a traditional batch ETL process which transformed the sources raw data into traditional SQL-type tables. As a consequence, we would not enjoy having raw data from sources available in Splunk for building new use cases nor Splunk’s real time indexing.

What I would recommend to continue maturing this initial setup:

  • Splunk integration directly with data sources to have all the raw data available and take advantage of real time indexing
  • Implementation of use cases with real time alarms during the sporting event
  • Operationalisation of the daily update of the scoring model (daily incremental retraining based on sporting events that occurred during the previous day) to keep it up to date

Happy Splunking, 


Lucas Alados

Posted by


Show All Tags
Show Less Tags