It’s time for a Boxee-ing match with Splunk!

And now for something completely different! In working with some interesting data generated by Boxee media center software, I found that we could use Splunk as a “Ratings Reporting Engine”.  Additionally, as Boxee is open source, I thought it might be handy to give their developers realtime access to my log data as its being generated.
Background:  Boxee is a cool, open source, media center software package that runs on AppleTV, Linux, MacOS X & windows (soon).  Allows you to watch movies, internet video content, even Netflix.  Boxee itself generates some interesting log data.  Boxee also allows for a viewer to automatically send a message to Twitter when a program is being viewed.

What could we do with this?:

Using Boxee’s own Logs:

  •   Detect errors so that the developers can see them live
  •   Calculate viewing duration on a local Boxee instance and some other cool reports

Using information in Twitter:

  •   Create reports that show most watched shows and most active users

     View all of this live right now in my public Splunk server

Local Boxee Logs 

In my setup, I have Boxee running on AppleTV.  Splunk is also running on AppleTV.  Splunk monitors data and forwards its logs up to my public Splunk server over TCP. Send yours up if you want!.  When I looked at the Boxee log data in Splunk there were a few events that piqued my interest

When the “DVDplayer” program opens a file to be viewed, it records and event and the same goes when it closes a file.  Hmm.. Makes me think that using Splunk’s “Transaction” search operator, I could tie them together, AND, calculate the duration of viewing.  Smells kinda fun.  How does that work?
Here’s the search command that’ll make this one work:

dvdplayer (opening OR “closing video”) NOT SQLite | rex “Downloads\/Boxee\/(?<title>[^\/]+)\/” | rex “Movies\/(?<title>[^\.]+)\.” | rex “file\/get\/(?<title>[^\.]+)\.” | transaction startswith=”eventtype=\”open-movie\”” endswith=”eventtype=\”movie-closing\”” maxpause=-1 maxspan=-1 | eval duration = duration / 60 | timechart max(duration) by title usenull=f


  1. Create event type called “open-movie” for any events that match this search:  “search = dvdplayer opening”
  2. Create event type called “movie-closing” for any events that match this search: “search = dvdplayer closing NOT audio“


In English it is“Find the dvdplayer opening or closing events, and get rid of the ones that have SQL Lite in them, because there are some errors happening (pipe to rex) to extract the title of the program from the filename (pipe to rex) to get more program titles because I have movies in two different directories (yeah, you can overload a field) – (then pipe it to “transaction”), define the transaction as beginning with the even type “open-movie” and ending with the eventtype “movie-closing”, setting the pause and span  as “-1” so built in rules don’t get in the way.  Transaction will create a duration (showing number of seconds), we’d better divide that by 60 so we can get it in “minute resolution”, and then (pipe to “timechart) to look at the maximum duration viewed by title.    This way, we’ll know what movies are popular locally—even if they’re watched multiple times.  (Breathe, you weren’t supposed to repeat that whole paragraph in one breath!)

Additionally, I created some reports that will allow the open source developers of Boxee to look at “Where the errors are coming from”.  I extract some info from the events

error OR failed OR severe | rex “ERROR: (?<error_source>[^\:]+)\:” | rex “ERROR: \[(?<error_source>[^\]]+)\]” | top limit=10 error_source

In english it is, find errors (pipe to rex) to create a field called “error_source” (do it again because there are two types of errors in boxee), then (pipe) to a top graph by error source, and then save it to dashboard, but display as “TABLE”.  Kinda handy so the devs can see that
most of my errors come from some “CGUIBoxeeViewState” Objects. 
The SQLite errors are also quite annoying.

Boxee Data on Twitter

If you’re asking yourself “what’s Twitter”, you are clearly not hip enough to be using “rex” or “transaction”.  Assuming you already know what it is, I’ll bet you didn’t know Twitter has a search engine (They bought from Summize).  Twitter Search indexes all “Tweets” and lets you retrieve results.  Why.. Well if you don’t listen to what people are saying publicly, should should start!  What are people saying about Splunk right now? See, that’s why Twitter is so valuable.. Not the “I’m sitting down to have Sabra with Amrit & David” posts most people do).

You can setup Boxee to “tweet” what you’re watching, and when you do–this happens:.  

 A message like this is posted to Twitter:   “jlarkins: watching Inherit the Halibut on Boxee” – about 1 hour ago.    

Pretty simple, and they’re all like that. Every message has the word “watching” followed by the title, followed by “on Boxee”. It also has a timestamp as well–which Splunk really likes.  If we run a search on twitter and ask it for “watching * on boxee”, we should get nearly all of those messages.  Notice in the upper right of the Twitter Search page,  there’s a“feed for this query” link.   If we run this search*+on+boxee  we’ll get back an ATOM feed which is like RSS but technically better. (Follow me kids, this is going somewhere cool).  

 The results of that search yield an Atom feed with XML for every Twitter message that looks like this:

<link type=”text/html” rel=”alternate” href=”″/>
<title>watching The Onion Movie on Boxee. check it out at “></title> 
<content type=”html”>&lt;b&gt;watching&lt;/b&gt; The Onion Movie &lt;b&gt;on&lt;/b&gt; &lt;b&gt;Boxee&lt;/b&gt;. check it out at &lt; a href=””>″&gt;;/a&gt;</content> 
<link type=”image/png” rel=”image” href=””/   >
<name>kiranboxee (kiranboxee)</name>



 Look at all that data, there’s the “author’s name”, there’s a timestamp, there’s the Title of the movie as well.. Or rather there’s that “watching The Onion Movie on Boxee” message in there.

Splunk Comes In Handy
Indexing that stuff: Using Erik Swan’s “Web Page Monitor (webping)” application on SplunkBase, I’ve configured my Splunk server eat the output of this URL*+on+boxee   .  I have it setup to ping that URL every 300 seconds (5 min).  Since Twitter search is only going to give me back about a page full of results, and those results change a lot, I decided every 5 minutes was fine — it turns out that might be too frequent—you’ll see why soon.  I did have to configure props.conf to know where to break events (BREAK_ONLY_BEFORE=\<entry\>), but once I had that done, my XML/RSS events that show each Twitter post on movie viewing was indexed by splunk.  If you didn’t know, we have a python search operator called “xmlkv” which will actually take those XML elements and turn them in to fields—for my purposes, I won’t be using that operator.

Searching –  If we run the search “source=”*+on+boxee” over a 7 day period we get way more than 50k results. Why, because we’re indexing a search engine, and there’s a chance we have a lot of duplicates in there (if I back off my ping time, I might have less).  

Sidebar:  every Twitter message has a unique number & URL for it. Look up there.. See “href” item in the “link” element–that’s it.

 Another Splunk search operator you probably didn’t know about is called “dedup” which will take search results and de-duplicate them based on the contents of a field.  This search:

source=”*+on+boxee” | dedup href

Yields only 321 unique results in the past 7 days… That’s more like it!.  By using some field extraction with multiline regex searching, we’re pulling out “username” and “title” and then graphing them.

Boxee Rating Reporting

In my Splunk server I have a “Boxee” dashboard, consisting of a few saved searches that reveal statistics about user activity gleaned from Twitter.  Check in from time to time, and you may see more.

Top programs viewed in past 7 days – via Twitter:  source=”*+on+boxee” | dedup href | timechart count(title) by title useother=f usenull=f

Top 10 Viewers in the past 7 days – via Twitter:  source=”*+on+boxee” | dedup href | top limit=10 username

If you hadn’t figured out, I’m a pretty big fan of Splunk. Its just so darn useful versus alot of other tools that deal with IT data.
So what did we learn (other than Wilde uses Twitter), ok seriously what did we learn:

Splunk Search language commands

  1. Transaction
  2. Dedup
  3. Timechart count
  4. Timechart max
  5. Eval

Splunk Applications:

  • Web Page Monitor (Webping)
    It appears, in my application of webping, I probably could backoff my ping time to like once an hour because I have a lot of dupes.

Do something cool with Splunk.  It causes you to read the docs, learn stuff you didn’t think you needed to know.  Got questions, let me know–I’m happy to help.

Disclaimer:  In regards to what may appear as the viewing of copyrighted material, any and all names, characters, places, locations, locales, business establishments, organizations, associations, groups, entities, dominions, states, nations, governments, beliefs, circumstances, conditions, and events portrayed in this story, text, writing, symbol, image, or illustration are either fictitious or fictitiously used. Any resemblance to real or actual persons (living or dead) are pure coincidence. Any resemblance to real or actual character, characters, place, places, location, locations, locale, locales, business establishment, business establishments, organization, organizations, association, associations, group, groups, entity, entities, dominion, dominions, state, states, nation, nations, government, governments, belief, beliefs, circumstance, circumstances, condition, conditions, event, or events that exist, exists, existed, have existed, or will exist are pure coincidence. Any resemblance to reality is pure coincidence.

Blogged with the Flock Browser

Michael Wilde

Posted by