# Scrape the web for football play-by-play data, part 2

04 Sep 2013

UPDATE: Part three of this series introduces the R package pbp, which contains the most up-to-date version of this software.

Last things first: here’s an extremely quick look at the distribution of rushing gains by Wisconsin’s running backs in that game, based on the script we’re developing in this series:

This is part two in a series. It will make more sense if you begin with part one, or at least part 1.5.

####The story thus far At this point we have a list in R called plays, with an entry for each play in a given college football game. Each item in the list is itself a list, with the elements poss, indicating possession; down for the down; togo for the yards to go; dist, the distance to the goal line; time, the approximate game time remaining (in seconds); and pbp, the narrative play-by-play text for this play.

An example play-by-play string:

Stacey Bedell rush for no gain, fumbled, forced by Brendan Kelly, recovered by Wisc Ethan Armstrong at the UMass 35.


Obviously, that’s information we want to be able to analyze, but the computer is dumb and can’t understand a simple, non-grammatical sentence like that one.

Once again, we turn to regular expressions. We’ll divide all possible football plays into a few types and compare the play-by-play for each play to a regex for each type of play. When the play matches a type, we can extract the roles that are relevant to that play typ (e.g. pass plays have a passer and a receiver but rush plays only have a ball-carrier). I’ve chosen to break plays into these categories (each bullet point will get its own regular expression):

#####Special teams:

• kickoff
• punt
• extra point (PAT)
• field goal

#####Scrimmage plays:

• rush
• pass
• interception

#####Results:

• fumble
• penalty
• touchdown
• first down

#####Other:

• timeout

In each case, we’re going to use the utility function regex from the earlier post to extract named groups matching the play’s roles.

Note that college football scores a sack as a rush, which is silly. But negative rush plays are not uncommon, so in order to reclassify sacks as pass plays we need to figure out who are the quarterbacks and then call any quarterback run for negative yardage a sack. IFor tis purpose, I’ve chosen to call any player who throws at least two passes in a game a quarterback.

Some scorers record tacklers but most don’t. I haven’t bothered trying to catch tacklers here.

Here’s the code. It should work if appended to the code from part 1.5:

Of course, this data is just for one game. For more detailed analysis, we’ll need to create a database of plays from several games. Stay tuned.