Scrape the web for football play-by-play data, part 1
02 Sep 2013
UPDATE x3: part three introduces the R package pbp, which contains the most up-to-date version of this software.
UPDATE x2: part two is up, where we parse the detailed play-by-play data for one game. Next up: compiling a database of plays from many games.
UPDATE: part 1.5 uses XPath to traverse the webpage for the relevant data, rather than running regular expressions directly on the HTML, as I’ve done here. The code in the linked post supersedes this one.
##Football’s back everyone!
My fascination (obsession, Kate might say) with sports is not a minority position nationally, let alone in Madison, Wisconsin. Yet within my social milieu, it stands out. In order to clothe it in respectability, I’m forced to turn it into that most euphemistic of bores: a learning experience.
So if my friends want to learn from my blog how to scrape text data from the Web and analyze it in R, then dammit they’re going to have to hear about football in the bargain.
##The lay of the land
So. You’re hoping to start a career analyzing college football advanced statistics, and you’re going to need data. Happily, scorekeeping seems fairly well standardized across college football. Sadly, the play-by-play is pretty rudimentary - for each play we get the location, the down and distance to go, the passer and/or ball carrier, the yardage gained, and sometimes the tackler. Time on the clock is recorded at the beginning of each drive. Short of a herculean effort like Football Outsiders is putting in as they re-watch every professional game ever played, that’s the data that’s available.
With something like 100 games happening each week, you’re not going to enter your data by hand, are you? No! You, dear reader, are much too smart for that. You’re going to scrape ESPN’s website for the play-by-play data and parse it with regular expressions.
##Aw shit, he’s processing text in R
The web is made of text and you already told me that you’re going to process the websites via regular expressions, so obviously we’ll write our play-by-play processing script in python or perl, right? I wish I could tell you that living in the year 2013 meant scripting everything in python, Junior, but life’s not going to just serve up pure buttered Win on a platter. We’ll parse the webpage in R because we want to eventually analyze our data in R.
While it would be possible to use python to parse the webpages into R-friendly tables, I’m going to use R for everything. Again, it’s a learning experience.
The script is below. Some of the R manpages I used while creating this script are:
generic R regex - All the details you don’t find in the other manpages are probably here.
gregexpr - Match a pattern multiple times on a single string, returning indexes and lengths of the matched substring and any captured groups.
substr - Extract substrings from a longer string using the index of the substring’s first and last characters.
adist - The ‘distance’ between two strings, which I use to decipher the various team name abbreviations.