2016 was another year of steady growth in cyberattacks and a year of big losses to fraud across many industries: from e-commerce and healthcare to banking, insurance and government sector.
User account takeovers, credentials theft, and online payment method takeovers have been, and continue to be the primary ways fraudsters cause big dollar losses and reputation damages to businesses throughout the years. More than 50% of successful damaging attacks are initiated with valid user credentials.
Geo location? Threat intelligence feeds? Device recognition? IP address correlation? Attackers are well aware of these detection techniques.
Fraudsters are becoming increasingly more sophisticated. Within minutes they can leverage worldwide clouds of cheap virtual machines to conceal their true physical locations and utilize specialized device identity spoofing tools in attempts to deceive the most sophisticated fraud detection systems as well as human experts.
Identifying malicious actor or suspicious transaction by IP address, computer device identity or browser user agent anomalies is becoming more challenging.
To build more sophisticated defenses we need to consider enriching traditional machine data sources with information that reflects unique and complex behavior patterns of a person behind the screen. That’s when we’ll start finding something that fraudsters and cybercriminals cannot steal or fake.
Humans are using computer systems in a ways that are unique and consistent with their own biology and physiology. The way humans click and type, the way humans use mouse and other input devices are pretty consistent with that person’s own behavior, habits, education level, and familiarity with a service or system.
Habits and behaviors are very difficult to change and if we can identify legitimate users by their typical behavior patterns - we can detect anomalies on a totally new level. Same goes with fraudsters - ability to identify and quantify behavior patterns of cyber criminal will allow us to uncover and neutralize threats that may be undetectable by other means.
The question is: can we recognize the user - or a class of users by some unique ways they use their input devices, such as mouse or keyboard? Enter Behavioral Biometrics: the field of study related to measure of uniquely identifying and measurable patterns in human activities.
This post will show one of the ways you can implement advanced user behavior biometrics solution for security based on Splunk and one of the Deep Learning frameworks.
Attempts to discover and classify behavior biometrics patterns were attempted by a number of industry players. Let’s consider the task of matching user to mouse activity. Traditional detection system executes complicated actions of feature extraction, data measurements and normalization. For each mouse movement event, the system would apply artificial trajectory smoothing, measure multiple points of velocity, acceleration, curvature, relative distances, inflection points, etc...
After such heavy pre processing traditional machine learning techniques are applied to extracted features. Model is trained and subsequent predictions are made.
The limitation of such approach is that complexity of task is artificially reduced to a subset of calculated features and the rest of data is ignored.
This almost guarantees that more complex, subtle, and yet very personal behavior patterns that are naturally present in the raw data will be missed.
To detect, extract and recognize infinitely complex behavior patterns that might be present in a treasure trove of user behavior data we need to look at complete, unfiltered datasets through data science that must go beyond traditional algorithms based on a subset of engineered features.
But how do we get access to detailed datasets representing user activity?
Just like with any other data source - it’s easy to get precise, fine-grained end user input device activity directly into Splunk.
Thanks to our talented senior software engineer Oleg Izmerly - here is the complete source code that demonstrates how to do exactly that.
Knowing that each mouse movement generates X and Y coordinates of mouse pointer and also a timestamp - we can collect this data and send it to Splunk to enrich traditional data sources, like clickstream, web and application logs. This way user’s session activity data will also contain information about potentially unique behavior patterns.
Any mouse activity generates an event every 5-10 milliseconds - so some mouse interaction sessions may contain 5,000-10,000 events per user per page. That’s a lot of data to analyze!
To tackle this challenge we’ve decided to aggregate and convert mouse activity data points to color images. After all, a mouse always moves along a flat, 2D surface and it’s natural to convert these set of events into single image without losing much details. A single image can represent thousands of mouse activity data points.
We’ve devised a specialized high contrast color encoding algorithm to represent direction, speed, and acceleration of mouse movements in a consistent way.
This, for example, has helped us represent the difference in patterns between left and right-handed people. Mouse click actions were represented in larger color circles and mouse click+drag movements were drawn in thicker lines.
This approach enabled us to encode not only trajectory lines but also to visually represent multidimensional dynamics of human behavior in all possible details ready to be analyzed and classified.
Once high contrast images are generated, we decided to utilize Deep Learning approach where all images are processed by specialized multi-layer image recognition algorithms that are designed to detect every possible pattern that may be contained within an image.
Deep Learning is a class of machine learning algorithms that leverage sequences of many functional layers with multiple units (neurons) and a special, non-linear, differentiable activation functions.
The subset of Deep Learning algorithms that have proved to be very efficient for image recognition tasks is called Convolutional Neural Networks. Convolutional Neural Networks can automatically discover features, shapes and patterns that are important for the given classification task. When dealing with very complex, unknown fraud and attack patterns, such approach represents a huge advantage as compared to manual feature engineering.
To build a custom neural network architecture to recognize and classify mouse movement activity we’ve picked an open source Deep Learning framework TensorFlow and high level deep learning abstraction library: Keras.
Keras allows to implement complete deep learning image recognition architecture in 50-100 lines of Python code. Due to its excellent abstraction capabilities Keras also allows you to use other deep learning frameworks, such as Theano with very little code changes.
We’ve leveraged inexpensive (~$350 as of this writing) mid range graphics card with 6GB of GPU RAM as a hardware accelerated processor to fully train deep learning network on mouse movement images.
Utilizing GPU for matrix-intense computations helps to speed up the training process 10x-100x times as compared to CPU-only approach.
For our data source, we’ve gathered live data feed coming from an actual financial information services web portal generated by real real visitors.
After setting up data ingestion architecture as described - Splunk was ingesting detailed mouse activity events from each site visitor.
Simple python script ran on a scheduled basis to retrieve mouse activity data from Splunk via API and generated several thousand images with mouse activity encoded and rendered in colored lines according to algorithm.
The data was ready for analysis and visitor classification!
The first task was to prove that deep learning network can be trained to recognize mouse movements of two distinct groups of users: regular customers of financial information services business from non-customers while accessing similar pages.
Our guess is that patterns of behavior of people who are first time visitors are somewhat different from members who are generally more familiar with the portal.
It takes certain degree of learning for a “stranger” to understand the structure of a portal. And such learning curve comes with a mouse activity patterns that might be different from patterns exhibited by regular customers who are in general more efficient in finding and getting access to the information they need.
The architecture of neural network we’ve chosen roughly resembles successful VGG16 architecture for image recognition. Standard VGG16 architecture was further optimized for specifics of the dataset of non-natural images as well as for a limited size of our dataset.
With input of total 2,000 images for training and 800 images for validation (1000 + 400 for each class) - it took about 2 minutes for such neural network to be trained to achieve about 81% of validation accuracy:
Which means that pre-trained system based on this deep learning network architecture can recognize actual customer from non-customer with 80%+ accuracy given input device activity it never saw before.
Which is quite significant for fraud and threat detection scenarios sensitive to “foreigners” claiming to be legitimate users.
We ended up having a neural network architecture suitable for behavior biometrics classification task of mouse movements between different user groups!
Individual user classification
Second task was more challenging. We wanted to determine how suitable deep learning approach may be in recognizing the individual user by his/her mouse movements.
We also added extra difficulties to this task to make it more realistic:
- The data set size for the task was extremely small: we had only 360(180+180 per class) images for training + 180(90+90 per class) images for validation. In the world of deep learning - this is *extremely* small dataset.
- We had one set of 180 training images representing mouse activity of a specific user - a portal member. Another set of 180 images belong to other people.
- “Other” people were also members of a portal.
- “Other” people exhibited very similar activity and behavior across the portal.
- “Other” people were using devices with dimensions that are similar to the first member - which means physical activity with mouse input device was as close as possible to the first member.
With all these conditions our neural network was facing very challenging task of recognizing an individual human only by his/her mouse movements after training on a very limited data set to begin with.
We had to modify neural network architecture even further with specific adjustments to make it stable in training on a very small data sets and to prevent overfitting.
We were still using the concepts borrowed from VGG16 convolutional neural network architecture, however an updated network had more layers to capture more complex patterns with, plus we added a number of measures to greatly reduce overfitting.
Such modified neural network also required more training epochs to generalize results:
After training in that scenario we were able to achieve 78% validation accuracy in recognizing individual user by his mouse movements on images that network never saw before.
It took network around 3 minutes of training to achieve these results. With extremely small number of samples intermediate iteration results were somewhat “jumpy” however it was to be expected.
Collecting more samples likely would improve these results even further.
Here is a summary and potentials of behavior biometrics:
- We can increase our capabilities to detect attackers and match legitimate user or group of users to their unique fine grained behavior patterns. This evolves to less false positives and more true negatives. This approach can also be used to reduce friction for legitimate users which is important for large financial institutions and e-tailers.
- We can reduce our typical dependency on IP addresses, device ID’s and browsers as all of these can be spoofed by determined attackers.
- We can increase our capabilities to identify sophisticated malwares, presence of malicious remote access toolkits and outright bad actors who are trying to stay under the radar.
- Even without deep learning - the fact of absence of mouse activity data could alert about the presence of malicious actor.
- Behavior biometrics match score data can be sent back to Splunk and enrich existing functions to offer more precise risk scoring.
- It is very easy to collect behavior biometrics data with Splunk and combined with Deep Learning it offer great potentials to boost capabilities of security and anti-fraud applications built on top of Splunk.
Oleg Izmerly (Sr. Software Engineer, Splunk)
Adam Oliner (Director, Engineering, Splunk)