*for legal reasons neither the author nor Splunk inc. can guarantee victory in Fantasy Football
Aside from featuring Jonah Hill’s breakout turn as a dramatic actor, the critically acclaimed 2011 film Moneyball is notable in that it covers perhaps the most famous collision between data analysis and sports. It chronicles how, using a data-driven approach, Billy Beane (GM of MLB’s Oakland Athletics) was able to remove biases, uncover undervalued baseball stars and, on a shoe-string budget, assemble a team which went on to win 20 consecutive games.
The film has always resonated with me - but, unfortunately, the closest I’m ever likely to come to managing a major sports team is through Fantasy Premier League, an online game where you construct a virtual team of players and score points based on their real-life performances. So, after a disappointing 6th place finish in the Splunk office league last season and with the September 12th start date of the new season fast approaching, I decided now was the time to utilise Splunk to emulate Billy.
Here is part one of my guide on how to use Splunk and the Machine Learning Toolkit to win your FPL league by not only identifying undervalued players to add to your team but also (in part two) using machine learning to project future performance.
Any data science investigation lives and dies with the quality of data available and fortunately, there’s plenty available for FPL. Whilst Splunk ingestible raw JSON data is available from FPL’s API, there’s a regularly updated historical archive of data from the API in CSV format available on GitHub.
After simply uploading the CSV archive into Splunk, the contents of the archive can be visualised. Although there are many different event types, the most useful for the goal of finding undervalued players (and predicting future week-by-week player scores) is gws which contains detailed performance information on every player for every game week going back to 2015 - although I’ll only be using data from the 2019/20 season.
By visualising with this game week data, we can easily chart player performance throughout the last season and uncover deeper insights than simply the total points of each player - for example the below column chart shows that top scoring Manchester City player Kevin De Bruyne’s points actually compare fairly equally after round 24 with Manchester United mid-season signing Bruno Fernandes, with the latter midfielder outscoring the former in 7/14 Game weeks - suggesting he may represent an overlooked comparable alternative to the expensive Kevin in the upcoming season.
Whilst the data does contain information on goals scored, minutes played and even metrics such as “creativity” and “threat” which will prove interesting to explore with the algorithms bundled in the MLTK - it lacks information on pricing (this is Moneyball after all) and team context.
Fortunately using the |outputlookup and |lookup commands this information can be extracted from the player_raw.csv and teams.csv to enrich the game week data, offering visualisations such as the following scatter graph which is hugely useful in finding the players who are best point-scoring value for money - represented by the nodes closest to the top left.
Whilst the scatter graph presents squad candidates such as Burnley Midfielder Ashley Westwood and Brighton Goalkeeper Matthew Ryan as being amongst the best points/value options, there’s still some data wrangling to do before the data mining stage.
Firstly, as well as having new prices for the upcoming season, some players have had their positions updated, which can be easily added to the GW events with a lookup to the 2020-21/players_raw.csv file, but as player position affects the points you receive for in-game events (i.e. goals, clean sheets, etc.) these need to be reverse-engineered from the official scoring rules on the FPL website.
After doing so, we can see just how drastically this affects certain players - for example in his new midfield position Pierre-Emerick Aubameyang would’ve been the 2nd highest scoring player as opposed to the 6th, potentially meaning he goes undervalued heading into the new season.
Secondly, for the purposes of this initial lineup optimization, I want to lean towards the players who are in better form and finished the season strongly. I can compute this by adding a join search on game week events with a round greater than 29 (the week in which the season paused due to lockdown regulations), and ad-justing the points based on new position as discussed earlier.
Now that each player has an updated position, price and post-resumption points/game, the next step is to generate an optimised lineup.
This is assembled by introducing a python script which solves a variation on the Knapsack Problem into Splunk and assigning a budget of 83.5 million (the 100 million maximum subtracted by the minimum cost of 4 bench players who won’t play in this hypothetical “set and forget” team) to create the highest possible points per game value out of any of the valid FPL formations.
Doing so produces the above team in a 3-4-3 formation which scored an astonishing 83.5 points a game week since June, and would project to 3082 points over a 38 week season (assuming top scorer Raheem Sterling was captained every week to double his individual points) a total which would’ve topped the global FPL league by a substantial margin.
While a useful exercise in identifying undervalued/in-form players for the upcoming season, the Moneyball XI is unlikely to score 3000 points this season - for one it doesn’t take into account team changes and injuries that selected players Doherty, Willian and Otto will have to deal with come the start of the season, and for another, it relies on players like Sterling continuing their form from June-July (where he scored 10 goals in 9 games) through a complete fixture list.
For a more accurate and contextual projection of player attainment, keep your eyes peeled for part 2 of this blog where I’ll show how to use the Splunk Machine Learning Toolkit to build on the data preparation staged here and forecast week by week performance for the upcoming season.
In the meantime,