I sometimes get asked if Splunk can detect fraud. The answer is yes, but the question is broad and needs an understanding of the situation that needs to be detected before making a generalization. Fraud here means using deceptive techniques for gains, which for the most part may be illegal. The two textbook ways to detect fraud usually involve pattern matching or statistical anomalies (or a combination of each).
Let me describe a real-life fraud detector. A few years ago, I used to work for an enterprise software company that used Mantas (which has since been acquired by Oracle) as a partner to detect money laundering activity. The software would load financial systems data into a database and run algorithms that they called “scenarios” to see if money laundering or other nefarious acts were committed. Since money laundering happens over a period of time and the data involved was already in some database schema, this worked well.
However, in today’s Big Data world, the data is often in no particular structure and the urgency to analyze it is when events happen rather than wait for an extract, load, and translate process. To generalize the Mantas example, I’d propose that Splunk can be used for fraud detection, if it has access to the data that may suggest fraud and a subject matter expert has identified a pattern or statistical anomaly that can help detect it. To illustrate this point at first, I’ll provide some examples using financial services. For each use case, I’ll try to provide a Splunk pseudo-search that can match the situation, which can be the beginning to designing an alert for the activity. I’ll spare the details of the data involved as any arbitrary timestamped event could be used.
Fraud over Distance
Let’s start with what I call fraud detected because of distance. Fellow Splunker Ed Jividen described a use case where a large amount of money is being transferred between two country regions where the destination is hardly populated. This would be suspicious as large wire transfers usually do not go in the middle of no where. A search that may detect this would be:
sourcetype=wire_transfer|lookup list_of_banned_regions dest_city_country OUTPUT is_suspicious|where amount>1000000 AND is_suspicious="yes"
This says for the wire transfer transaction event, look up if the destination country’s city is suspicoius by annotating the event with a new field and see if the transaction is greater than a one million going to a suspicious destination. You could use a Splunk macro to avoid hard coding the million.
Another example of fraud at a distance is when the same user withdrawals funds from their ATM more than once in the last 15 minutes from two different cities.
sourcetype=ATM action=withdrawal|transaction user maxspan=15m|eval location_count=mvcount(location)|where location_count>1|stats values(location) by user
This uses the transaction command to group by users in 15 minute spans and if they withdrew funds from two different locations, it may be fraud. To further refine this, you may want to incorporate the distance between the two locations to avoid false positives by using the haversine add-on from the Splunk Apps place.
Another example of detecting fraud using distance is when a non-traveler person uses a credit card at a place outside their home country. The search for this is quiet easy.
sourcetype=credit_transaction|lookup user_lookup user OUTPUT is_traveler home_country|where is_traveler=false AND home_country!=transaction_location
Since I just mentioned a credit card example, it is worth having a section devoted to this topic. Credit card companies have a vested interest in making sure they detect fraud as quickly as possible, which means the waiting for ETL of data to put into a relational database to run long running algorithms is not applicable. They want to know as soon as events happen, even if it may lead to a false positive as this is acceptable for the greater good of protecting consumers and themselves.
One common use case is when a credit card gets stolen and the thief first purchases a small item such as candy or gas to make sure the card works and then purchases a large ticket item as quick as possible before suspicions arise in a 15 minute span.
sourcetype=credit_transactions -earliest=15m| stats min(amount) as min max(amount) as max by user|where min<50 AND max>500|table min, max, user
In this case, the user has purchased an item less than 50 and greater than 500 in less than 15 minutes. As before, you can use macros to not have to hard code the thresholds.
(Update: Splunker Ed Elisio described to me a way to use the streamstats command in conjunction with transaction to keep a running sequence number for each transaction grouping. Using this, our search could be for this example:
sourcetype=credit_transactions| transaction user maxspan=15m|search eventcount>1 | streamstats count as sequence | mvexpand amount | stats values(user) as user list(amount) as amount_values count(eval(amount<50)) as match1 count(eval(amount>500)) as match2 by sequence |where match1>0 AND match2>0|fields - match1, match2, sequence
The transaction command will group events by users for a 15 minute span. Streamstats will assign a sequence number for each grouping. Mvexpand will turn the multi-value field amount into individual events and now stats can be used to find the condition for amounts less than 50 and greater than 500 in the same grouping.)
Another use case is when the user usually purchases from set categories and now purchases from a category they never have before. For instance, the user usually purchases from groceries, electronics, and restaurants. For this purchase, they purchased an expensive item from “fine_arts”, which is not in their usual categories. A look up can be used to find their usual categories.
sourcetype=credit_transaction NOT [search credit_transaction|lookup user_lookup user OUTPUT categories|makemv categories|rename categories as category|fields category]
This search creates a multi value field for usual categories, renames it to category, which is in the data and sends it out to the outer search which also has a category field. If no results come back, great. If a result comes back, you have at least one category that is not in the usual categories.
One of the simplest characteristics of suspicious behavior is a large number a transactions in a very short period. For instance, 50 purchases in a day may sound suspicious.
sourcetype=credit_transaction -earliest=1d|stats count by user|where count>50
A deviation of that search would be to find all users who have purchased 3 times the standard deviation of their average purchase in the last 7 days.
sourcetype=credit_transaction -earliest=7d|stats count by user|eventstats avg(count) as avg_count stdev(count) as std_count|where count>avg_count+(3*std_count)
Financial services is usually the most glowing place to investigate fraud because that’s where the money is. However, fraud can pervade any industry. Splunker Mark Seward often gives examples of such behavior in his talks. One of his examples covers the fraud over a distance use case where a doctor prescribes a large amount of medications to his patient, who happens to live on the same block.
Another example that also covers fraud at a distance is when the patient tries to use the same prescription twice (by saying he received a fax copy) at different pharmacies. Fortunately, this usually gets caught quickly.
Speaking of pharmacy, I’ve heard of one such example where the staff member removes one narcotic pill per N prescriptions thinking no one would notice. This is similar to the banking example where the staff member automates the deduction of a few pennies per randon deposit. In the pharmaceutical example, security cameras will probably detect this slow random fraud before it even gets to data analysis.
One final example is when a hypochondriac decides to go to many different doctors at the same time. This one is very simple to detect and has the same search as the large number of purchase transactions over a short period of time as described above.
Entire books can be devoted to this subject and that is the indeed the case. What I wanted to show you was the power of the Splunk search language to detect fraud with arbitrary data, where you can reuse the patterns for a number of industries. All it takes is access to the log events and someone who knows what is indeed fraud to automate the use case.