Let’s Talk About Text, Baby

The following is a guest post from Nathan Worsham, IS Security Administrator at Pinnacol Assurance.

Machine data is a private language all to itself and a tough nut to crack; luckily Splunk has made it accessible for almost everyone. Human or natural language has its own problems, however. It’s like a junk drawer filled with random objects (and some sort of sticky substance that’s best not to think about). Which is to say, natural language is especially difficult for machines to deal with on the best of days. 

The field of Natural Language Processing (or NLP) has had a resurgence in recent years, very much due to the rise of machine learning. NLP is a branch of artificial intelligence that attempts to use and understand language. Alternatively, text analytics encompasses varying techniques used to investigate, explore and break down data that is composed of text. The NLP Text Analytics app built on the Splunk Machine Learning Toolkit (MLTK) takes some of the processes that live under the text analytics umbrella and uses Splunk to carry out the work and display it. Although I should probably mention here that this umbrella is full of holes and you’ll probably get wet...

Of course, when you think of text data, you might not necessarily think of Splunk. However, Splunk supports natural language more than you might think. Possible sources could include service tickets, surveys, comments and really any sort of free text fields in web forms (e.g., text buried inside a log message). Using Splunk DB Connect is another great way to find other text sources that are seeking asylum inside of databases, bringing them into Splunk.

I work in cybersecurity, but I also have a master’s degree in data science. I built this app because of my interest in Splunk, machine learning and language (and I also needed a project to do for my master’s degree practicum). At the time, there didn’t seem to be enough tools in Splunk to deal with textual data and it felt like a ripe area to explore (see here for an in-depth write-up concerning the app’s creation). Since then, the app has continued to grow.

The NLP Text Analytics app is strawberry jam-packed with custom commands, additional machine learning algorithms not found in the Splunk MLTK (although they’ve contributed to Splunk’s GitHub repo for sharing algorithms: mltk-algo-contrib), sample datasets and a series of dashboards. The design is to hold a user’s hand through the lifecycle of a data science project centered on text. Since Splunk is such a transparent platform, the user can easily see the SPL for how any process was accomplished by simply clicking on the magnifying glass in the lower right corner of a panel. Dashboards are such a powerful part of Splunk, though they may not always get the recognition they deserve.

Text analytic processes that are covered in the app include:

  • Text Exploratory Data Analysis (EDA): Using lemmatization, ngrams, and parts-of-speech—simple word counts may help to extract or mine meaning and concepts out of the text.

  • Document Classification: Using machine learning algorithms to predict document target labels.

  • Named Entity Analysis: Using parts-of-speech tagging and concurrency to find named entity connections and relationships.

  • Sentiment Analysis: Using a rule-based (lexicon) sentiment analyzer—note that often sentiment analysis is performed with ML algorithms and treated as a classification problem.

The common denominator in all of these dashboards minus sentiment analysis is the cleantext command that comes with the app.

This command automates the process of—you guessed it—making the text squeaky clean. Using Python’s natural language toolkit (NLTK), a series of permutations is performed on the text that could certainly be accomplished in native Splunk already. However, the command simplifies the process and performs other operations that are non-native to Splunk; mainly talking about parts-of-speech tagging and lemmatization here. To lemmatize text means to break words down to their base form. A simple example would be that most plural words become their singular version. The command has much-o options, so I would recommend checking the documentation, but one example is that (by default) the cleaned text is returned as a multivalued field (mv=true), which is good for use with the stats command, but if you’re sending the result through the TFIDF preprocessor, you’ll want to set this to be false.

The last point I want to close on is that text data is massive due to the tokenization of each word. Consider it fair warning that text operations are going to be slower than what you may be used to.

Finally, if you have an interest in text and Splunk, you should definitely check it out. I always welcome feedback from the community. Building Splunk apps is a great learning experience (and a lot of fun too), which I highly recommend. 

Nathan Worsham is a cybersecurity administrator which gives him no right to be a data scientist or software developer but he doesn’t let that stop him. He holds an undergrad in music, masters in data science, certifications in GSEC, GCFA, and Splunk Admin as well as an unusual middle name. When he is not busy pushing Splunk on everyone he meets, his kids are busy ignoring him recite old Simpson’s quotes.

Posted by