Splunking pitchfork album reviews

One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex :

.*?<h2 class=”fn”>\s*(.*?):<br />([^\n]*)\n.*?<div class=”info”>\n\[([^<;]*);?\s*(\d*)\]?.*?<span class=”rating”>(.*?)<.*?<div class=”content description”>(.*?)</div>.*? – <span class=”reviewer”><span class=”vcard”><span class=”fn”>(.*?)</span>.*?title=”\d+”>(.*?)<

I can now run some interesting queries :

  • * | chart avg(rating) by releaseYear

    Which graphs the average rating per calendar year of the release.
  • *| stats count(title), avg(rating) by artist | search “count(title)”>2| sort “avg(rating)” d | head 10

    This shows the top rated artists that have a least 3 reviews on pitchfork
  • * rating<=10 rating>0 | stats avg(rating) as avg_rating, count(title) as title_count by label | search title_count>3 | sort avg_rating | head 10

    This shows that Invisible Records are the worst reviewed label on Pitchfork.
  • * | stats count(title), avg(rating) by reviewer | search “count(title)”>4 “avg(rating)”>7.5 | sort “avg(rating)” d

    This search finds all the reviewers that have at least 5 reviews and on average score higher than 7.5. So if you want a good review on pitchfork you’re better off with Luke Buckman :)
  • * | eventstats count(title) as titleCount by reviewer | search eventtype=7_dirty_words titleCount>3 | stats count(title) as ct ,max(titleCount) as mf by reviewer | eval blue_index=ct*1.0/mf | sort blue_index d

    This is my personal favourite, it’s a list of reviewers most likely to use the one of George Carlin’s seven dirty words (nsfw). The mf column is the count of reviews with one of the words and the ct row is the review count for that reviewer. The blue_index is the mf/ct.

So there you go : Splunk > it’s not just for logs.

Brian Murphy
Senior Software Splunker
Splunk Inc.

Posted by


Join the Discussion