bash.org is a natural dataset for splunking. It’s a huge blob of loosely structured text data, and it’s made of win.
To play with a live instance, go to bash.splunklabs.com, login: guest, password: guest.
Of course, Splunk duplicates the functionality of the site itself. We can find, for example, the top 100 IRC quotes:
Splunk lets us do considerably more, though. What are the top one-liners?
How many more quotes mention “girlfriend” than “boyfriend”, i.e. exactly how bad is this sausage party?
Are there any commonly quoted individuals?
Are there any interesting trends in quote scores over time? Take a look at high quote scores vs. quote ID:
It seems likely that older quotes, especially good ones, benefit from a disproportionately greater number of views (the rich getting richer, so to speak); this might explain why the peaks in the low-quote-ID ranges are higher than the peaks for more recent quotes. Or maybe the internet just doesn’t produce the same quality of LOLs that it once did.
To try this yourself, add the following to props.conf:
BREAK_ONLY_BEFORE = (#[0-9]* \+)|([0-9]+-[0-9]+-[0-9]+-[0-9]+-[0-9]+-[0-9]+)
REPORT-bash = bash
and the following to transforms.conf:
REGEX = #([0-9]+) \+\((-?[0-9]+)\)- \[X\]
FORMAT = $0 bash_quote_id::$1 bash_quote_score::$2
Then, get a static copy of bash.org. You can grab the one I’ve created here, or you can generate it yourself:
$ curl -o '#1.html' 'http://bash.org/?browse&p=[001-409]'
$ for cur in * ; do lynx -dump -nonumbers ./$cur >> /tmp/bash.txt ; done
Finally, push the data into Splunk:
$ splunk add tail -source /tmp/bash.txt -sourcetype bash