Machine Learning To Predict Drug Effectiveness

Simple Machine Learning to Predict Drug Effectiveness (Based on Patient Demographics)

Note: The inspiration for this paper is from the Splunk ML team, who wrote a great document on how to demo the Splunk MLTK. This approach is based on their findings.

Introduction

A couple of years ago, I was tinkering with a Spunk Machine Learning Toolkit demo and it dawned on me that using the simple but comprehensive ideas of logistical regression and random forest algorithms can predict many categorical things without having to be a Ph.d in data science. As healthcare was on and still is on everyone’s minds, I chose to predict what fields (features) may contribute to the results of drug trials using a machine learning approach. Here’s the disclaimer: Neither myself nor Splunk claims expertise in running drug trials nor its results. Mathematics is itself an accurate discipline, but the inclusion/exclusion of data sets, the inclusion/exclusion of fields in the data to consider, bias in the data itself, and malformed methodologies to conduct analysis is beyond the scope of this blog entry. Use this blog entry as an idea for testing your dataset for predictability, but only within the context of your own proven methodology.

Now that the disclaimer is out of the way, I’ll mention that this blog entry highlights a simple approach for making predictions for the effectiveness of a drug used in clinical patient trials using sample data in CSV format and the Splunk Machine Learning Toolkit. It is assumed that the user has access to relevant data and the Splunk MLTK to replicate the results here. If you have not used Splunk MLTK, here’s some background information to get you started.

For the purposes of this example, I have used the following fields to represent patient data. Of course, in real life, there may be many more features. Please also note that all data presented here is strictly fabricated for the purpose of illustration.

Effective,Age,Smoker,Ventilator,Overweight,Sex,Daily_Dosage,Immune_Deficient

No,70,Yes,Yes,Yes,M,5,Yes

Yes,21,No,No,No,F,15,Yes

Yes,66,No,Yes,No,M,5,Yes

The Effective field is the actual result of whether the drug worked. What our fictitious, but educational, trial wants to see is what fields in the data help predict whether the drug will be effective. Be effective, I mean a positive result is achieved for treatment. The Daily Dosage is an artificial dosage given in mg. All other fields, besides age, are binary. What we are interested in doing is categorizing the Effective field with Yes or No based on the values of the other fields. In my sample data, I purposely skewed the sample set so that Effective equals Yes, when the patient is young (less than 50) and is a non-smoker. All other fields have random values. In real life, the data would not be skewed as much as the results would be based on actual data.

Logistical Regression

The first thing I did was use logistical regression to create a model to predict the Effective field’s value using 80 percent of the data to build the model and 20 percent of the data to test. I only have 100 records in my data set to make it easy for us to follow. Here’s the results using the experiment portion of the Splunk MLTK:

Although all the fields were used, you’ll notice that there is a 93% precision in the results because as I mentioned earlier, the results are skewed to work well with the age and the smoker field. If we run a search with our fit command to see how many times the Effective field’s value did not match the predicted(Effective) field’s value, we’ll see that it only failed 10 times, which is in line with our precision number.

| inputlookup Drug_Trials1.csv | fit LogisticRegression fit_intercept=true "Effective" from "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator" into Trial_One_LR|rename predicted(Effective) as pred|where pred!=Effective

Now, let’s apply this model on a randomly generated data set with 100 records. Because there is no rhyme or reason to this new data set, the number of mismatches will be large as in 50%.

| inputlookup Drug_Trials2.csv | apply Trial_One_LR|rename predicted(Effective) as pred|where pred!=Effective|fields – Effective

Although the prediction with “non-sample” data is only 50% correct, it may still be useful in times of epidemics, when this particular drug still shows a chance that it can be effective.

Random Forest

A Random Forest algorithm has a likelihood of predicting a categorical value with a higher precision than logistical regression (depending on the data size) because it randomly divides the data into decision trees and combines the results so that no one part of the data can strongly influence the whole. Putting the Splunk MLTK in action with a search that uses Random Forest and fit to find out mismatches in our sample data set, we get a near 99% match as only one prediction is wrong.

| inputlookup Drug_Trials1.csv | fit RandomForestClassifier "Effective" from "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator" into Trial_One_RF|rename predicted(Effective) as pred|where pred!=Effective

Now, we have two models, Trial_One_LR and Trial_One_RF. Let’s combine them into one search.

Combining Logistical Regression with Random Forest

In our models, we see that each model is effectively a series of weights assigned to fields (features) by their respective algorithms. These weights are what are used to predict future values of our Effective field.

Now, what if we combined both models and put a weight for the results of each model to come up with a cumulative score? We’ll give Random Forest way more importance in our search by assigning it a value of 100 and give Logistical Regression results a value of 10 every time each predicts that the drug is effective (Effective=Yes). We’ll also subtract 10% times the patients age from the total score as we know beforehand, for some reason, that this drug may not be as effective for older patients. The total score when both models report Yes would be 100 + 10 – 0.1*Age. We’ll then use a threshold to ensure some kind of confidence so that if our prediction is greater than the threshold, we are confident that it is probably going to be correct. In this case, we are saying the score (100 + 10 - 0.1 * Age) is greater than 105, our threshold. Let’s see the Splunk search.

| inputlookup Drug_Trials2.csv | fields - Effective | apply Trial_One_LR as LogReg_Effective | apply Trial_One_RF as RF_Effective

| eval priorityscore_Effective = if(LogReg_Effective="Yes",10,0) + if(RF_Effective="Yes",100,0) - 0.1*'Age'

| sort - priorityscore_Effective |where priorityscore_Effective>105 |table priorityscore_Effective LogReg_Effective, RF_Effective, "Age" "Daily_Dosage" "Immune_Deficient" "Overweight" "Sex" "Smoker" "Ventilator"

If you notice because we are now relying on 3 factors (2 models plus the patient’s age), we now only have 41 predictions that we think are correct in our totally random second data set. In a real trial that used accurate values for the model and more sample records, we would have gotten good results for future trial data, if multiple fields did in fact influence each other.

Conclusion

By using categorizing algorithms, we can build models to predict the effectiveness of a drug based on the demographics of the patient. Even if this is just a paper exercise, it provides efficacy for the confidence of the results, when multiple features may be involved in the effectiveness of the drug. The ML approach provides scientific backing for the prediction and may give valuable advice on what fields influence the outcome. As I said in the beginning, mathletics is an accurate discipline, but there are many other factors that lead to confident results. You can use your time series data and the Splunk MLTK as another tool for your analysis in the context of your own methodology.

Related Articles

How Splunk is Helping Shape the Future of Higher Education IT by Tackling EDUCAUSE 2026 Top Issues
Industries
3 Minute Read

How Splunk is Helping Shape the Future of Higher Education IT by Tackling EDUCAUSE 2026 Top Issues

Dive into how Splunk aligns with key priorities highlighted at EDUCAUSE 2025.
Enhancing Government Resilience: How AI and Automation Empower Public Sector Missions
Industries
3 Minute Read

Enhancing Government Resilience: How AI and Automation Empower Public Sector Missions

Splunk helps government agencies boost security and efficiency with powerful, mission-ready AI and automation.
Solving Manual Mayhem in Telecom with Agentic AI
Industries
3 Minute Read

Solving Manual Mayhem in Telecom with Agentic AI

Agentic AI cuts downtime, improves security, and boosts customer experience, and with unified data from Splunk and Cisco, teams can build more resilient operations.
Upgrading to Splunk Enterprise 10.0 and Splunk Cloud Platform 10.0: Key Resources for Public Sector Customers
Industries
2 Minute Read

Upgrading to Splunk Enterprise 10.0 and Splunk Cloud Platform 10.0: Key Resources for Public Sector Customers

Splunk Enterprise 10.0 and Splunk Cloud Platform 10.0 deliver the most secure, stable, and modernized platform for a digitally resilient and compliance-ready future.
Building the Next Generation of Defenders: From the Classroom to the SOC of the Future
Industries
3 Minute Read

Building the Next Generation of Defenders: From the Classroom to the SOC of the Future

Resilience in the AI era doesn’t just happen – it's built one student, one SOC, and one organisation at a time.
Analytics That Work: 3 Approaches for the Future of Contact Centers
Industries
3 Minute Read

Analytics That Work: 3 Approaches for the Future of Contact Centers

Splunker Khalid Ali explains how unified, real-time intelligence connects data, empowers agents, and builds lasting customer loyalty.
Observability + Security: Real-Time Digital Resilience for SLED
Industries
1 Minute Read

Observability + Security: Real-Time Digital Resilience for SLED

Cisco and Splunk are helping public sector organizations build digital resilience.
Digital Resilience for State and Local Governments (Part Two)
Industries
3 Minute Read

Digital Resilience for State and Local Governments (Part Two)

Discover how collaboration—powered by shared data platforms like Splunk—can enhance incident response and overall digital resilience.
Reflections from SIBOS 2025: How will advances in technology (and especially AI) change the financial services industry over the next 5 years?
Industries
2 Minute Read

Reflections from SIBOS 2025: How will advances in technology (and especially AI) change the financial services industry over the next 5 years?

Discover key insights from SIBOS 2025 on how AI, collaboration, and data will reshape financial services over the next 5 years—prepare for rapid change and exciting opportunities ahead.