Ever wonder if you can use machine learning to predict the binary value of a field based on other fields in the same event? An example would be to predict if a cancer treatment has a chance to show progress (progress=yes vs. progress=no), or if a manufacturing assembly line will produce major defects (defect=true vs defect=false) based on certain conditions, or if it is likely to rain (precipitation=true vs. precipitation=false).
Last year, I wrote about how the free Splunk Machine Learning Toolkit could be used to predict effectiveness of a drug based on patient demographics with a somewhat simple example set of data. In this blog, my goal is to do something similar, but the use case is predicting whether a trade settlement in the financial services industry will fail to be completed as. Recently, I’ve written about how the new T+1 compliance directive —which mandates that all USA trades starting in May 2024 be settled in at most one day — is the driving force behind wanting to provide resilience to the trade settlement process.
Before we get into the details, I want to generalize this machine learning approach to predict categorical fields for a use case rather than writing about the same pattern for every related use case so that it can be reused for multiple industries. To make the prediction simple and more precise, we will be interested in only predicting the value of one field (feature or variable name) and it will only have a binary value. Let me start with the steps that can be taken with the general purpose use case and then repeat the steps for predicting trade settlement failures.
The ability to get unstructured data ingested easily, extract fields at will, and use Splunk Search Processing Language (SPL) to analyze data is what makes the Splunk platform an easy choice for machine learning deployment.
General Use Case Steps
We will follow the steps outlined in this diagram.
1. For Splunk users, please download and install the free app, Splunk Machine Learning Toolkit and its associated app, the OS platform dependent Python for Scientific Computing, from Splunkbase.
2. Ingest events for your use case into Splunk just like any other events or use a CSV file with relevant use case fields to create your training data. In fact, to make things easier, I would suggest you prepare or create a csv file of relevant fields from the ingested data (Splunk fields commands along with the export button) to train your MLTK model. You will still be ingesting events into Splunk to apply your trained models against them.
3. Extract relevant fields for your use case, especially for the real world data that will be used to predict the category value of a field. If the data is structured, Splunk can easily extract it by itself. If the data is unstructured in any arbitrary format, you can use the Splunk Web Interface to extract fields from the GUI. If you know how to use regular expressions, the extraction becomes even easier. Notice, we only extract fields from the data relevant to the use case and we do it after the data is indexed. Although performance should alway be considered, I will skip Splunk data models and data model acceleration for this topic, so we can get back to the concepts.
4. Consider using Splunk Lookups to annotate your events with relevant fields for your use case. The reason for this is to realize that not all your fields will be in the events and lookups will be used to enrich the search from external sources such as CSV files, databases, output of an API or Splunk KVStore. Examples of added fields may include a person’s address, a DNS name, details about an order, etc.
5. Train your model by using the guided workflow for experiments in the Splunk MLTK. For this general purpose binary category prediction, I’ll show an example from the MLTK showcase under the predicting fields section, specifically the predicting telecom customer churn example.
You need to provide a set of training data, which I suggested could be a CSV file representing all the combinations that affect the real world and pick an algorithm for training the model.
I tend to not use logistic regression as it seems the least precise in many use cases and do like to use random forest classifier as it provides more precision. To do this accurately, consider trying out all the algorithms for your models and save at least two that are more precise. You can also add more algorithms from the Python Scientific Toolkit, if needed. Next, pick your field that you want to predict and all the supporting fields from the event data that will influence the prediction of the predicted field. Press the fit button to run the test and save it to a model. Below is an example of this process for predicting customer churn from the Splunk MLTK Showcase.
At the bottom of this same web page is the precision of your experiment. The closer you get to 1, the more accurate it is. Note, that if all your predictions are 100% accurate, it may mean that your test data is biased toward a particular outcome (overfit) and it may not reflect the real world data. In addition, there is a “confusion matrix” on the bottom right of the page, which explains how many false positives and false negatives were found.
6. Now that we’ve saved some models, let's show how they can be applied to real world data. For this step we’ll use SPL as we want to combine multiple models and assign a weight to each one. Here’s the general pattern.
index=some_index sourcetype=some_data_source <add more AND, OR, NOT to filter events>| fields - predicted_field_name|<add lookups to enrich with more fields>|apply model1 as model_rf|apply model2 as model_lr|...|eval priorityscore_fieldname=if(model_rf="True.",100,0) + if(model_lr="True.",10,0) + … + <value of some other field>|where priorityscore_fieldname > <some threshold>|sort - priorityscore_fieldname|table priorityscore_fieldname model_rf model_lr … <names of other fields>
What we are doing here is applying multiple models to the real world data to predict the categorical field in the data. The reason we are applying multiple models is to give weights of importance to each model based on the model’s accuracy scores. In this example, model1 has 10 times more importance than model2 so its weight is assigned a 100 while model2 is assigned to 10. Just to add to the score, one of the fields from the data is added as it may have a small influence on the accuracy of the predicted field. In our Splunk MLTK Showcase example, I added the number of customer service calls to the score as that may also explain why the customer is leaving or causing churn. Here’s the search using the Splunk MLTK Showcase data.
There you have it. I have shown the general way to use Splunk MLTK to predict the binary value of any categorical field. To solidify your understanding, I will list out the steps one more time using Splunk SPL, but this time we will be predicting failed trade trade settlements based on the semantics of the business as opposed to technology issues.
Trade Settlement Failure Predictions
Using machine learning to predict whether a trade is going to fail in the settlement process endangering its SLA is not a new concept. An Internet search found some articles from Systems Integrators and at least one top 20 USA bank that allude to the subject. Many things can delay the trade to make it not complete before its settlement deadline and I will list some of them as we can use them as features.
- Complex settlement instructions
- Multiple currencies involved
- Multiple Odd Lot Securities that historically have allocation issues
- Missing fields such as CUSIP
These are trades that have a higher probability of failing and not meeting their settlement deadline. Similarly, there are vanilla trades that have a simple lot (say 100 or 1000 shares), have no complex instructions, trade in the same currency, and have a history of trading without any semantic business issues. If we know this information, there is no need to force these “simple” trades into the machine learning gauntlet to see if they will fail to meet their settlement deadline as their probability of success is very high. With that in mind, before even applying all real world trade events to the machine learning models, it may be wise to simply collect the trades that have a probability of having issues into a Splunk summary index while not using the majority of trades against a machine learning model. Using Boolean logic, if a trade has any of the characteristics of a troublesome trade (which may not fail, but has a high probability of failing), then it makes sense to schedule a search to collect these events into a trade issues summary index. This may reduce the dataset for every million trades into 50,000 trades in the trade issues index, which can further be examined by machine learning models and searches. I would encourage this step before applying machine learning models to all trades.
Clustering Trades by Characteristics
I am going to pause from our goal to introduce another way to partition out which trades should go to the trade issues index. Boolean logic may work very well, but one can also use machine learning to cluster trades by their characteristics. Without getting too much into the detail, what I’ve done below is used the Splunk MLTK to create a model that clusters trades by 3 features: number of shares, length of instructions, and a priori knowledge of allocation probability. Based on this, we can then collect the trade IDs for one cluster to send to the trades issues index to be applied to our category predictive MLTK models. In real world situations, the number of fields that can influence the failure of a trade will be a lot more than three, but three were used here as an example. I realize that allocation probability is a feature that may not exist, as it in itself may require machine learning to compute, but I use it here for illustrative purposes.
In case you are wondering, the algorithm used to create the cluster is kmeans in this case and the SPL is as follows to divide the data into two clusters.
| inputlookup trade_training.csv|table instruction_length,number_of_shares,allocation_probablity
| fit KMeans "allocation_probablity" "instruction_length" "number_of_shares" k=2 into "trade_cluster"
Keep in mind that the use of clustering here is optional and by all means, if you can use boolean logic to separate trades into a trade issues index to apply models, please do so. I have digressed from the main topic here and will now return to it.
Fit the Model
I am not going to repeat all the steps from above for gathering trade settlement events into Splunk, but do keep in mind that an event that logs the requirement to settle a trade with a trade ID will be missing multiple fields used to predict settlement failure unless the developer logs all required fields with the event. This is where lookups can come into play as a Splunk lookup can be made to enrich the event with the fields that will influence the probability of failure and these fields can come from anywhere such as a custom built CSV file, a referential database, a CMDB, other Splunk ingested events, or a third-party API.
However, as before, it is recommended to have a training set with these fields already populated into a CSV file. Fortunately, the use of lookups, the table command in SPL, and the export button on Splunk web makes this an easy task. Let’s create a couple of models from training data and save them into a model name in the end. The variable we will predict is called settlement and its values will be yes and no, meaning yes this trade can settle in time and no it cannot settle in time meaning the trade may fail to meet its settlement date. I will use random forest classifier and Gaussian naive bayes as the algorithms to predict the settlement field.
| inputlookup trade_training.csv|fit RandomForestClassifier settlement from "allocation_probabilty" "currencies" "instructions" "CUSIP" "odd_lot" "inventory" "ETF" "shares" "watch_list" "settlement_date" … into "settlement_model_rf"
| inputlookup trade_training.csv|fit GaussianNB settlement from "allocation_probabilty" "currencies" "instructions" "CUSIP" "odd_lot" "inventory" "ETF" "shares" "watch_list" "settlement_date" … into "settlement_model_gnb"
(Notice the … in the list of influenced fields as there will be more based on your knowledge of what influences a trade settlement) This experiment to create the model can also be created using the MLTK workflow GUI just as before. Please note that since this is training data, the settlement field should be populated with actual yes and no values based on real world experience. After being satisfied with the precision results for the predicted settlement field with each model, it’s time to use both models in a weighted manner to predict whether a trade will fail.
Applying the Models to Trade Settlements
The MLTK apply command will be used for both models to apply to real world datasets from an index that has already been populated with trades settlement request events that are suspected to have issues based on heuristics and boolean logic. A simple example of heuristics is to use Splunk to find out if any trade’s instruction length is three times the standard deviation plus the average length of all trade instruction lengths in a dataset. The thinking is that unusually long instructions will delay the trade settlement. As before, here’s the SPL to apply the models to trade settlement events in a weighted fashion with priority scores for trades that should be investigated for possible settlement failure because the model has predicted that result. I’ve added 10% of the length of the number of characters in the instructions to the failure priority score below in the SPL.
index=trade_issues sourcetype=trades|fields - settlement|apply settlement_model_rf as settlement_rf|apply settlement_model_gnb as settlement_gnb||eval instruction_length=len('Instructions') * 0.1|eval priorityscore_failure=if(settlement_rf="No.",100,0) + if(settlement_gnb="No",10,0) + instruction_length|where priorityscore_failure > 103|sort -priorityscore_failure|table priorityscore_failure settlement_rf settlement_gnb tradeID instruction_length allocation_probablity … currency
Finally, we have created the use case as this will output a table of all trades that are predicted to fail based on at least two models and the length of the instructions. A trade settlement department can now start looking into where these trades are in the cycle to reconcile any issues before the trade settlement actually fails. The other thing to note here is that once more information is known for what creates failed settlement cycles for a trade, that can be fed back into the training dataset to create more refined models. This approach would definitely help the T+1 regulation use case.
To plan for the future after this use case is developed, we can consider using Splunk MLTK to use numerical forecasting algorithms to predict the volume of failed trades for future days once we have a dependable forecast on whether a trade will fail to meet its settlement date deadline. Seasonal variation has to be taken into account as January volatility, end of quarter days, and other outside factors have to be considered for this forecasting. This topic then goes beyond the scope of the citizen data scientist like myself and I will leave the exercise to the experts to plan it out so as not to make this blog any longer than it already is. Of course, this is food for thought.
This blog explained the general use case for using Splunk MLTK to predict any categorical field with a simple binary value. The number of real world cases for this are many and the same pattern applies. In cybersecurity, categorically predicting a threat for a threat object comes to mind. In observability, categorically predicting scaling issues under heavy load is a use case. In this blog, the ability to predict whether a trade will settle by its settlement date based on the business semantics of the trade is a primary concern within the financial services industry. I hope that this approach will help any Splunk user utilize machine learning for enhancing their use case experience for better business outcomes.