# Scrape the web for football play-by-play data, part 1.5

03 Sep 2013

UPDATE x2: Part three of this series introduces the R package pbp, which contains the most up-to-date version of this software.

UPDATE: part two is up, where we parse the detailed play-by-play data for one game. Next up: compiling a database of plays from many games.

####Mea Culpa After yesterday’s post (part one in this series) some folks asked why was I applying regular expressions directly to HTML rather than using a tool like XPath to navigate the DOM. The sad truth is that, prior to fielding those questions, I had no idea that XPath even existed, let alone its R implementation in the XML package.

I’ve rewritten the example to use XPath, but we don’t get to unleash its full power here because the play-by-play data we’re scraping (see this example) is stored in a flat table. It seems to me that XPath will be most useful for highly structured data.

Nevertheless, stripping out all the HTML as XPath does makes it easier to write the regular expressions that will interpret the narrative play-by-play text (e.g., Joel Stave pass complete to Jared Abbrederis for 65 yards for a TOUCHDOWN.)

####A helper function One thing we’re going to do a lot of is run a regular expression against a string and then extract all the named capturing groups. To make life easier, I’ve written a utility function for that purpose. You pass in the pattern and the string to match; it returns a table where each row contains the complete set of named capturing groups (unmatched optional groups are returned as NA):

####Part 1 revisited

With that done, here’s the code to duplicate yesterday’s effort using XPath. The next part will extract detailed information for each play.