You don’t often think about big data and romance at the same time unless you really love analytics and nobody has ever died of a broken heart chart. However, Valentine’s Day is upon us once more and across the world, data scientists are looking at analytics around the sales of roses, chocolates and cards with cuddling, fluffy bunnies on them. I’ve written in the past about using data to improve your success at dating and also why big data is like falling in love but this year, it felt like I needed to analyze Valentine’s Day from a much more practical perspective.
I’ve been married fifteen years, have three children and working at Splunk keeps you busy. When Valentine’s Day falls on a Tuesday (AKA “a school night”) and you can’t find a babysitter, the evening tends to consist of getting back late from work, waiting for one of the kids to wake up and probably watching a RomCom (in between aforementioned child based interruptions).
Love Actually is guaranteed to be on one of the TV Channels but there’s only so many times you can watch Hugh Grant as a dancing Prime Minister so I thought I’d try and use Splunk to analyze a dataset from MovieLens (*citation below) and figure out which romantic movie to watch. It is a pretty amazing dataset – it includes about 27,000 movies and over 100,000 reviews. Worth noting that I had to spend some time carefully checking for and removing “adult” movies as the dataset was pretty complete!
I started off by adding the MovieLens movie and ratings dataset to Splunk. This automatically gave me a lookup table for the movieID, Title and Genre:
I then created a new table view between this lookup table and the Splunk index that had the ratings data. This gave me a new dataset table view:
I then created a Pivot table from the dataset table and turned that into a dashboard that shows the movies with the highest individual ratings if you’re looking for Romance, Romantic Drama, Romantic Comedy or Romantic Action (the words could have been more carefully chosen there, I admit). Click the picture to enlarge the dashboard:
There are some interesting findings from the data. There are some films I’d heard of but a lot I hadn’t. Also, there were movies from the 1920s right up to the modern day. Interestingly, Top Gun came out as a romantic action movie. I can work with that…
I then thought a bit more about the results and realized that there may only have been one or two people who rated a movie really highly, so I then went on to create the second part of the dashboard. I took the combined ratings of all movies and then average rating to see if that shed any more light on what should be the film. First up was all the movies as a Pivot table and then a dashboard element:
This helped a little bit. It gave me a list of the combined ratings for the top ten movies and their average. Not sure Pulp Fiction or Star Wars are going to create the right kind of atmosphere. I took the top ten table and added the filter to make sure “Romance” was included in the genre. This gave me slightly better results:
(Click to enlarge)
(Click to enlarge)
I still wasn’t convinced that Forrest Gump was the film of choice (I don’t really want to be encouraging my wife to “run Forrest run”). I’m not really feeling the love from True Lies or Speed. Keanu Reeves and Arnold Schwarzenegger aren’t really my romantic companions of choice.
At least I’ve ended up with a realistic shortlist – looks like my Valentine’s Day evening is either Pretty Woman or Titanic. Perhaps if I’m at my most charming I can see if I can get away with Top Gun…
Have a great Valentine’s Day. As always thanks for reading.
* F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872)