Building a $60 Billion Data Model to Stop US Healthcare Fraud (Part 1)

At Splunk, we look at big problems as data challenges; solutions are always data-driven. The patterns of fraud are always hidden within the data—no matter how sophisticated fraudsters are in how they abuse the system and stay under the radar. Solutions to all problems big and small start with gathering data and seeing the big picture through the data analytics lens.

With modern US Healthcare programs’ complexity and sophistication, fraud losses in healthcare cost US taxpayers a staggering number approaching one hundred billion dollars per year.

Health care fraud costs the United States tens of billions of dollars each year. Some estimates put the figure close to $100 billion a year."

– United States Department of Justice

The Centers for Medicare and Medicaid (CMS) published datasets related to provider's drug prescriptions billed to Medicare; the data used in this research work and demo represents historical records from 2014 and contains approximately 24 million real records with the summary of provider activities billing Medicare for specific drugs on behalf of their patients. The CMS also made available supplementary datasets such as payments from drug manufacturers to providers (cash, stock shares and investments), and there's also a very interesting dataset available called "exclusions." The exclusions dataset contains a list of providers who were excluded from participation in Medicare due to miscellaneous violations, misrepresentations or outright fraudulent actions related to the drug prescription and billing practices.

Using these datasets, we decided to leverage the data analytics capabilities of Splunk Enterprise to analyze and study these datasets. These are the problems we’re trying to solve:

  1. Discover anomalous and potentially fraudulent providers.
  2. Predict whether a given provider will be excluded from participation in Medicare.

Each task requires filtering abilities provided by Splunk SPL language. We also leveraged Machine Learning capabilities provided by Splunk Machine Learning Toolkit. Unsupervised learning/Kmeans clustering will be applied to discover clusters of anomalous providers. Supervised learning will be used to predict exclusions.

To solve the above tasks, we built a custom app that contained two dashboards:

1. General Investigation dashboard to find anomalies: "Healthcare Provider Claims Anomaly: Prescription Drugs"

2. Detailed analysis of each healthcare provider: "Healthcare Provider: Detailed Profile Analysis"

The “General Investigation” dashboard enables customers to do the initial pre-filtering, such as selecting certain provider specialty, optional geographic regions by state, city or by drug predisposition.

Once the initial filtering is applied, SPL query extracts relevant data from millions of healthcare claim records and select providers matching the criteria, then applies the machine learning clustering algorithm to data. We used KMeans algorithm to cluster all providers by their Medicare drug billing and prescription behavior and assign each provider to relevant cluster.

After the clustering step is completed, the analytical dashboard applies two final filtering steps:

  1. Filter by drug predisposition
  2. Filter by drug anomaly threshold to distill the results to most narrow, most relevant and most anomalously behaving providers

Drug predisposition in a powerful filtering option that has two selections: "opioids/narcotics" and "most expensive/branded." This filter engages SPL analytical query to discover healthcare providers who have a tendency to prescribe—and bill Medicare for—either narcotics or very expensive/branded drugs in an excessive manner.

"Drug anomaly threshold" selects only providers who exhibit excessive prescription behavior skewed toward certain drugs. These two filters prove to be very efficient to discover anomalous and, quite often, fraudulent providers.

The results are visualized on a Geo map as well as via table where data could be sorted by any column. The most productive way to investigate anomalies and discover potentially fraudulent activity is to sort either by total dollar amount of claims per year or by the cluster size.

Cluster number is a value that appears after machine learning clustering algorithm is applied.

Anomalies usually represented by the smallest, most isolated, "remote" clusters.

The majority of typical, "normally" behaving crowd would be concentrated within larger clusters.

General Investigation dashboard has an (experimental) ability to visualize provider anomalies in 3D space, which helps to understand an anomaly by its relationship to typical provider behavior.

3D representation of data shows providers (each healthcare provider is represented by a colored dot) grouped into clusters. An anomalous provider stands outside of the large crowd.

Here's a visual representation of all nationwide providers belonging to "Interventional Pain Management" specialty with "drug predisposition" filter set to "narcotics" and who are billing Medicare for more than $250,000 per year.

As you can see alongside other anomalies, there are two providers that stand outside of the baseline group and who were later convicted of fraud. That ability to do pre-filtering, clustering and sorting to narrow the providers for further investigation helps to make investigator's job more efficient and reduce false positives.

Going back to the General Investigation Dashboard to investigate all nationwide providers of "Interventional Pain Management," if we would filter providers by “Total Claim” amount exceeding $1M, we will only see 20 providers matching the filtering criteria.

The dashboard filters allowed us to select the 20 most anomalous providers nationwide out of more than 46,000. Upon further investigation of these results, we actually found that 4 out of these 20 providers were convicted of fraud by the Department of Justice.

The most anomalous 2 providers (listed at the very top of the table) belong to the smallest clusters with only 1 and 2 elements each. In addition, the top anomaly "John Couch" is on top by the total amount of dollars billed to Medicare—more than $6M per year.

Another interesting use case in fraud analytics is to select all nationwide providers in Pediatric Medicine specialty with "narcotics" drugs predisposition who billed Medicare for more than $100,000/yr. It is unusual for pediatricians to prescribe lots of narcotics to their patients, so the resulting table shows only 3 results selected out of almost 55,000 providers nationwide.

The top, most anomalous entry Dr. Daniel Cham belongs to the smallest (most anomalous cluster), and he also billed Medicare for the biggest amount—$1.2M dollars.

Upon rendering his profile in "Detailed Profile Analysis" dashboard, we can clearly see that his prescription behavior strongly deviates from typical pediatrician behavior. Out of $1.2 million dollar total claims, he prescribed Oxycontin (strong, addictive narcotic) for $347,000 dollars which represents about 27.5% of his total yearly claims. This is about 50 times higher than typical provider behavior of the same specialty.

Clicking on his name will open a Google page showing search results. One of the first pages is a US Justice Department page showing that this doctor has been convicted of illegal distribution of addictive painkiller drug and laundering proceeds of his drug trafficking

The "Claims Anomaly" dashboard includes the study of 9 real-world use cases and 11 providers who exhibited strong anomalies, and were later (sometimes years later) convicted of fraud and crimes. This demonstrates the power of Splunk analytics and Machine Learning to not only detect anomalies but also uncover real fraudulent activity hidden within data.

The full demonstration and examples of these functions described so far will be available within the Splunk Security Essentials for Fraud Detection app, announced earlier this week during .conf2017.

In part two of this series, we explore predicting provider exclusion via supervised learning.

Follow all the conversations coming out of #splunkconf17!

Special thanks to:
Manish Sainani, Director - Product Management, Machine Learning & Advanced Analytics, Splunk
Iman Makaremi, Sr. Data Scientist, Splunk
Alexander Johnson, Sr. Software Engineer, Machine Learning, Splunk

Gleb Esman
Posted by

Gleb Esman

Gleb Esman is Sr. Product Manager for Fraud Detection at Splunk.

With a technical background in analytics, security research and development, Gleb helps to guide product development efforts in the areas of fraud detection, analytics and investigations.

With experience in security research and building fraud detection, analytics and investigation applications at a major financial institution, Gleb helps ensure that Splunk customers will get the best of breed, cutting edge solutions to tackle costly challenges with fraud across multiple industry verticals.

Gleb is an author of patent applications in the area of deep learning, security and behavior biometrics.