Simple Machine Learning to Predict Drug Effectiveness (Based on Patient Demographics)
Note: The inspiration for this paper is from the Splunk ML team, who wrote a great document on how to demo the Splunk MLTK. This approach is based on their findings.
A couple of years ago, I was tinkering with a Spunk Machine Learning Toolkit demo and it dawned on me that using the simple but comprehensive ideas of logistical regression and random forest algorithms can predict many categorical things without having to be a Ph.d in data science. As healthcare was on and still is on everyone’s minds, I chose to predict what fields (features) may contribute to the results of drug trials using a machine learning approach. Here’s the disclaimer: Neither myself nor Splunk claims expertise in running drug trials nor its results. Mathematics is itself an accurate discipline, but the inclusion/exclusion of data sets, the inclusion/exclusion of fields in the data to consider, bias in the data itself, and malformed methodologies to conduct analysis is beyond the scope of this blog entry. Use this blog entry as an idea for testing your dataset for predictability, but only within the context of your own proven methodology.
Now that the disclaimer is out of the way, I’ll mention that this blog entry highlights a simple approach for making predictions for the effectiveness of a drug used in clinical patient trials using sample data in CSV format and the Splunk Machine Learning Toolkit. It is assumed that the user has access to relevant data and the Splunk MLTK to replicate the results here. If you have not used Splunk MLTK, here’s some background information to get you started.
For the purposes of this example, I have used the following fields to represent patient data. Of course, in real life, there may be many more features. Please also note that all data presented here is strictly fabricated for the purpose of illustration.
The Effective field is the actual result of whether the drug worked. What our fictitious, but educational, trial wants to see is what fields in the data help predict whether the drug will be effective. Be effective, I mean a positive result is achieved for treatment. The Daily Dosage is an artificial dosage given in mg. All other fields, besides age, are binary. What we are interested in doing is categorizing the Effective field with Yes or No based on the values of the other fields. In my sample data, I purposely skewed the sample set so that Effective equals Yes, when the patient is young (less than 50) and is a non-smoker. All other fields have random values. In real life, the data would not be skewed as much as the results would be based on actual data.
The first thing I did was use logistical regression to create a model to predict the Effective field’s value using 80 percent of the data to build the model and 20 percent of the data to test. I only have 100 records in my data set to make it easy for us to follow. Here’s the results using the experiment portion of the Splunk MLTK:
Although all the fields were used, you’ll notice that there is a 93% precision in the results because as I mentioned earlier, the results are skewed to work well with the age and the smoker field. If we run a search with our fit command to see how many times the Effective field’s value did not match the predicted(Effective) field’s value, we’ll see that it only failed 10 times, which is in line with our precision number.
| inputlookup Drug_Trials1.csv | fit LogisticRegression fit_intercept=true "Effective" from "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator" into Trial_One_LR|rename predicted(Effective) as pred|where pred!=Effective
Now, let’s apply this model on a randomly generated data set with 100 records. Because there is no rhyme or reason to this new data set, the number of mismatches will be large as in 50%.
| inputlookup Drug_Trials2.csv | apply Trial_One_LR|rename predicted(Effective) as pred|where pred!=Effective|fields – Effective
Although the prediction with “non-sample” data is only 50% correct, it may still be useful in times of epidemics, when this particular drug still shows a chance that it can be effective.
A Random Forest algorithm has a likelihood of predicting a categorical value with a higher precision than logistical regression (depending on the data size) because it randomly divides the data into decision trees and combines the results so that no one part of the data can strongly influence the whole. Putting the Splunk MLTK in action with a search that uses Random Forest and fit to find out mismatches in our sample data set, we get a near 99% match as only one prediction is wrong.
| inputlookup Drug_Trials1.csv | fit RandomForestClassifier "Effective" from "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator" into Trial_One_RF|rename predicted(Effective) as pred|where pred!=Effective
Now, we have two models, Trial_One_LR and Trial_One_RF. Let’s combine them into one search.
Combining Logistical Regression with Random Forest
In our models, we see that each model is effectively a series of weights assigned to fields (features) by their respective algorithms. These weights are what are used to predict future values of our Effective field.
Now, what if we combined both models and put a weight for the results of each model to come up with a cumulative score? We’ll give Random Forest way more importance in our search by assigning it a value of 100 and give Logistical Regression results a value of 10 every time each predicts that the drug is effective (Effective=Yes). We’ll also subtract 10% times the patients age from the total score as we know beforehand, for some reason, that this drug may not be as effective for older patients. The total score when both models report Yes would be 100 + 10 – 0.1*Age. We’ll then use a threshold to ensure some kind of confidence so that if our prediction is greater than the threshold, we are confident that it is probably going to be correct. In this case, we are saying the score (100 + 10 - 0.1 * Age) is greater than 105, our threshold. Let’s see the Splunk search.
| inputlookup Drug_Trials2.csv | fields - Effective | apply Trial_One_LR as LogReg_Effective | apply Trial_One_RF as RF_Effective
| eval priorityscore_Effective = if(LogReg_Effective="Yes",10,0) + if(RF_Effective="Yes",100,0) - 0.1*'Age'
| sort - priorityscore_Effective |where priorityscore_Effective>105 |table priorityscore_Effective LogReg_Effective, RF_Effective, "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator"
If you notice because we are now relying on 3 factors (2 models plus the patient’s age), we now only have 41 predictions that we think are correct in our totally random second data set. In a real trial that used accurate values for the model and more sample records, we would have gotten good results for future trial data, if multiple fields did in fact influence each other.
By using categorizing algorithms, we can build models to predict the effectiveness of a drug based on the demographics of the patient. Even if this is just a paper exercise, it provides efficacy for the confidence of the results, when multiple features may be involved in the effectiveness of the drug. The ML approach provides scientific backing for the prediction and may give valuable advice on what fields influence the outcome. As I said in the beginning, mathletics is an accurate discipline, but there are many other factors that lead to confident results. You can use your time series data and the Splunk MLTK as another tool for your analysis in the context of your own methodology.