By now, some of you over the years may have downloaded from Splunkbase my reference implementation for using scripted input to index RSS feeds or have read about the topic. The idea is that this input is very low in daily volume (possibly in KBs/day as opposed to MBs/day), but presents itself with many different correlation opportunities from the same Splunk console. This was originally written in Python and used the publicly available feedparser.py to download and parse the RSS feed. The issues I have heard over time are some people are not allowed to install Python on a forwarder machine, have the wrong version of Python that may not work with feedparser.py or simply have issues with the return values and want to edit the code, but they do not know Python. In response to all this, I rewrote this scripted input from scratch as another reference implementation in Java that you can download from Splunkbase.
The only requirement you’ll need to deploy this add-on is a 1.6 or later Java run-time. If you want to recompile the code, you’ll also need a JDK.
In this Java based edition, everything about the scripted input RSS feed should be open source. First, I downloaded a 3rd party RSS feed parser from Subinkrishna G as this was very easy to use. If you do not want to use this parser, you are free to plug in your own. Next, the code to use the RSS parser to grab RSS feeds supplied by the user in a file every N minutes, was written by myself, and you are free to modify it as you please. Finally, since this was written from scratch with no existing users, I put in an Environment variable that you can set to true in the calling shell script (or bat file on Windows) that checks for duplicates.
How does it check for duplicates? I used the Beta version of the Splunk Java SDK and simply check if a link field delivered by the RSS feed already exists in the Splunk index in the last 24 hours. If it does not exist in Splunk, I send the RSS entry to standard output to index it. This approach is probably not recommended in most use cases, but since the volume of RSS feeds is relatively low, doing a needle in the haystack search for the last 24 hours every hour or half hour for about 50 to 100 entries is probably not going to slam the Splunk indexer(s). You are free to change the earliest search time line in the code to less than 24 hours. (I also adopted this approach in the existing Python RSS Scripted Input on Splunkbase, but since there were already established users for this code, I wrote a new program to check for duplicates using the Splunk REST API to not disturb existing deployments, and included that as part of the add-on.)
Speaking of the Splunk Java SDK, it’s my first time using a Splunk SDK for an add-on since 2009, when I wrote the Everybody Splunk with the SDK’s article. Things have changed a lot since then in the Splunk world as we have a very active SDK development effort underway. When the Splunk Java SDK becomes GA, I will update this add-on with the new splunk.jar file to do the search.
I wrote this little rap three years ago about Everybody Splunk with the SDK and it really has remained unfinished. In closing, I would like to add to it.
Everyone say hey.
Find the needle in the hay.
Let Splunk show you the way.
Listen to your data.
Hear what it it has to say.
Use the web, the CLI, or
the Splunk SDK.
Find the Gold in the bits.
No time for hissy fits.
Do it before it’s too late.