Splunkbot: Node.js, IRC, and the JavaScript SDK for Fun and Profit pt. 1

Okay, so possibly not so much for profit, but I have developed a really cool IRC bot that logs to Splunk and provides a web interface for searching those logs.  Several weeks before I started with Splunk, I was winding down my old job and looking for something I could do that was Splunk related.  I’m a total IRC geek.  I’ve been hanging out on #macintosh on Undernet for over 17 years, and I discovered while browsing Splunk’s website that there was an official IRC channel for Splunk, #splunk on EFnet.  I spent the first week lurking and slowly introducing myself to the regulars, and then since I had some downtime, the idea occurred to me that it would be really fun to take the traffic from the IRC channel, log it into Splunk, and provide a web GUI to search that traffic.

I had also been reading a lot of buzz about Node.js and eventloop/non-blocking programming, and given that node had been mentioned numerous times as an excellent way to implement networked applications, I figured I should set out to implement my project in Node.  Splunk at that time had also recently released a preview of our JavaScript SDK, so all the stars seemed to be aligned to learn a new technology and get to play with some new functionality from Splunk.

First things first, lets get some data

So, before I could develop the awesome Web GUI for displaying IRC logs that was lurking in my head, I needed to get data in.  To do that, I needed a resident IRC program which would take all the IRC protocol traffic and translate it into a format that would easily allow Splunk to provide field extractions.  Luckily, Node has a rich environment where the community contributes back open source libraries which are easily installed via a package manager called npm.  In npm, I found an excellent IRC library that cut down development time significantly.  The IRC protocol is relatively trivial to implement, but still, having someone build it and prove it in advance is certainly easier.

Once I completed the ability to read IRC properly, it was time to get the data into Splunk.  Since I work with both Splunk Enterprise and Splunk Storm, I implemented logging for both.  For Enterprise, I log via a syslog-like TCP connection, and for Storm I log via a REST based output.  For both, I log in JSON format.  Originally, I logged in an autokv friendly key=value format, but since I’m logged user generated data, it was very difficult escaping for quotes and commas and autokv didn’t do well with escaping.  Since JSON’s pickling and depickling accounts for escaping, it was easiest to log in JSON format.  Here’s a screenshot from Storm showing how the data shows up in Splunk:

You can find the library for logging to storm here, and it can easily be adapted to log to Splunk’s receivers/simple REST endpoint as well (they are basically identical).  This is very nice because you can log straight to a Splunk indexer and specify index, host, source and sourcetype when you log without having to configure Splunk to assign them.  Also, here is the logging library which logs syslog from node to Splunk in case that might be useful for your project.

What can we learn

With the data in Splunk, what can we learn?  I set out to create some simple dashboards that told us some data we didn’t previously have about the IRC channel:

The first widget is obviously a very simple timechart of activity.  It told us something those of us who frequent the channel already know, that we’re primarily active during work hours.  The other two charts are more interesting though.  They tell us who is the most active, using this search:

index=splunkbot sourcetype=splunkbot_logs | spath | search to=#* | top nick

Note spath, this is what allows Splunk to extract the fields from the JSON logged data.  In most cases, you’ll probably want to specify which fields you want to extract using spath, but in this case I want to extract everything so I specify no fields.  The second search tells us to look for only traffic destined for IRC channels (all IRC channels begin with #), and then give us the top nick.

The second widget gives us the most mentioned people on the channel.  On IRC, while talking in a channel which often has multiple conversations going, it’s common to prepend your chat with the nickname of the person you’re addressing, like “Coccyx: I think you’re the most awesome Splunker ever!”  This search looks for people’s nicknames in the channel text and builds a dashboard of the nicknames most referenced:

index=splunkbot sourcetype=splunkbot_logs | spath | search action=message | rex field=text mode=sed "s/://g" | rex field=text mode=sed "s/,//g" | makemv delim=" " text | mvexpand text | rename text as nick| join nick [ search index="*" sourcetype="splunkbot_logs" action=names | makemv delim=" " names | mvexpand names | rename names as nick ] | top nick

This search is pretty complicated, so its useful to break it down.  The first two rex commands use mode=sed, which allows rex to replace text in a field based on a regular expression.  Here, we’re replacing , and : which are commonly used appended to a nick to indicate we’re addressing them, as in the earlier example.  Those will mess up our matches, so we want to delete all instances of them.  The next command, makemv, takes the text field, which in our logs contains the text of IRC messages Splunkbot has logged, and turns it into a multi-value.  This basically breaks the text up into tokens that we can match, which we hope to contain nicknames.  Mvexpand takes a multi-value field, and makes a new event for every value in the multi-value field.  Next we rename the text field to nick so that we’ll have the same name for the subsearch we’re about to run.

Now we find a command I don’t see often used, but is very powerful, the join command.  The join command works like you’d expect join to work in SQL, taking two searches and joining them together based off one or more fields which match.  In this case, we want join to work like an inner join, refining our search, which due to the commands above contains an event for every word spoken on Splunk over the last 7 days, and then match against our subsearch.  The subsearch references a particular log entry Splunkbot makes, which is the names entry, which it gets on a regular basis as people join and leave the channel.  This gives a list of nicknames to match against.  Finally we pipe the results of the join and subsearch to top to get the top referenced nicknames.

Wrapping up

So, getting our IRC log data into Splunk has already provided some interesting insights.  Next week, I’ll cover using the JavaScript SDK to provide a really killer web interface to Splunkbot which can be accessed by anyone, not just those who have access to Splunk.  All the code for Splunkbot can be found on Github, so feel free to head over there and browse through the code for ideas about your own projects.

Clint Sharp

Posted by


Show All Tags
Show Less Tags