Platform

January 03, 2013

2 Minute Read

Predicting Missing Data

By Splunk

Teach Splunk to predict missing field values in your data! With the brand new Splunk Predict App, you can predict, and fill-in, the value of missing fields in your data, using training sets that have values. This app builds Naive Bayes models to predict field values. In some test sets, this model often predicted values correctly 99.95%+ of the time.

From customers that fill out their gender, you can predict the gender of customers that have not, perhaps based on writing style, word choice, or other features.
From events that list a host name, you can predict the host name for events that are missing it.
From customers that explain why they unsubscribed from a mailing list, predict why others left even if they didn’t say why.

If you have the actual field value in question, use the predicted value against the actual value to determine if values are unexpected. Does the event’s data look like it belongs in this source of data, or is it suspicious.

Suppose you have a dataset with missing or questionable values. You can now predict the missing values based on other values. For example, in human entered data or social media data (e.g., twitter), imagine predicting the political or demographic information based on zipcode, first name, salary etc. Alternatively, you have one dataset that has a field filled out and another data set where that field is missing or sporadic.

Lastly, you can use the Predict app for sentiment analysis. For example, you can have a small training set of emails, each marked up with “angry=10” or “angry=1”, and have it learn to recognize angry emails. Angry emails can get directly routed to a manager.

App Details

This app includes four search commands:

train to train the model to predict a field value
guess to fill in missing field values
reset to delete a trained model
icluster to cluster data based on it’s information similarity. Are two emails written by the same user, using different accounts

For details on the parameters for each of these commands, typeahead will provide all the defaults. Make sure to click More on the typeahead instructions.

Examples

For example, to learn gender from names, you might say train it with:

gender=* | fields name, gender | train name2gender from gender

If you don’t limit the fields to “name” and “gender” it will use all fields to predict gender. If you have an inkling of what fields can predict other fields, limit things, otherwise, don’t bother and it will figure it out.

You can have it predict “gender” for events that don’t have a gender field specified.

* | guess name2gender into gender

Another example, predict the sourcetype from the _raw text of events. First train a model:

index=_internal | train getsrctype from sourcetype

Then use that model to guess sourcetypes and compare it to the real sourcetype value to measure accuracy:

index=_internal | rename sourcetype as real_sourcetype | fields real_sourcetype



 | guess getsrctype into sourcetype | top sourcetype,real_sourcetype

----------------------------------------------------
Thanks!
David Carasso

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

Platform 2 Min Read

Dashboard Studio: Level-Up Your App with Dashboard Studio

We reimagined the dashboards in the Microsoft 365 App for Splunk using Dashboard Studio, and you can too!

Platform 2 Min Read

Dashboard Studio Tips: What's New in 8.2.2106

You asked, we answered. The Dashboard Studio release in Splunk Cloud Platform 8.2.2106 comes with improvements requested by you: UI to add data sources to inputs, hiding the Edit or Open in Search buttons, a brand new markdown visualization, and more!

Platform 6 Min Read

Walkthrough to Set Up the Deep Learning Toolkit for Splunk with Amazon EKS

Splunk DLTK supports Docker as well as Kubernetes and OpenShift as container environments. In this article, we will go through the setup for using DLTK 3.3 and Amazon EKS as a kubernetes environment.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Predicting Missing Data

App Details

Examples

Related Articles

Dashboard Studio: Level-Up Your App with Dashboard Studio

Dashboard Studio Tips: What's New in 8.2.2106

Walkthrough to Set Up the Deep Learning Toolkit for Splunk with Amazon EKS

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram