Predicting Missing Data

Teach Splunk to predict missing field values in your data!  With the brand new Splunk Predict App, you can predict, and fill-in, the value of missing fields in your data, using training sets that have values.   This app builds Naive Bayes models to predict field values.  In some test sets, this model often predicted values correctly 99.95%+ of the time.

  • From customers that fill out their gender, you can predict the gender of customers that have not, perhaps based on writing style, word choice, or other features.
  • From events that list a host name, you can predict the host name for events that are missing it.
  • From customers that explain why they unsubscribed from a mailing list, predict why others left even if they didn’t say why.

If you have the actual field value in question, use the predicted value against the actual value to determine if values are unexpected.  Does the event’s data look like it belongs in this source of data, or is it suspicious.

Suppose you have a dataset with missing or questionable values. You can now predict the missing values based on other values. For example, in human entered data or social media data (e.g., twitter), imagine predicting the political or demographic information based on zipcode, first name, salary etc. Alternatively, you have one dataset that has a field filled out and another data set where that field is missing or sporadic.

Lastly, you can use the Predict app for sentiment analysis. For example, you can have a small training set of emails, each marked up with “angry=10” or “angry=1”, and have it learn to recognize angry emails. Angry emails can get directly routed to a manager.

App Details

This app includes four search commands:

  • train to train the model to predict a field value
  • guess to fill in missing field values
  • reset to delete a trained model
  • icluster to cluster data based on it’s information similarity. Are two emails written by the same user, using different accounts

For details on the parameters for each of these commands, typeahead will provide all the defaults. Make sure to click More on the typeahead instructions.


For example, to learn gender from names, you might say train it with:

    gender=* | fields name, gender | train name2gender from gender

If you don’t limit the fields to “name” and “gender” it will use all fields to predict gender. If you have an inkling of what fields can predict other fields, limit things, otherwise, don’t bother and it will figure it out.

You can have it predict “gender” for events that don’t have a gender field specified.

    * | guess name2gender into gender

Another example, predict the sourcetype from the _raw text of events. First train a model:

    index=_internal | train getsrctype from sourcetype

Then use that model to guess sourcetypes and compare it to the real sourcetype value to measure accuracy:

    index=_internal | rename sourcetype as real_sourcetype | fields real_sourcetype
    | guess getsrctype into sourcetype | top sourcetype,real_sourcetype
David Carasso
Posted by David Carasso

Join the Discussion