Gathering RealClearPolitics Polling Trends with XML

Now that the election is over, you may want to use polling data in a model of the campaign. Simon Jackman has thoughtfully made his daily state-by-state predictions available for download, but a commonly-used dataset is the RealClearPolitics polling average.

As you can see when you go to RCP, they have a nice HTML5 graph (screenshot above), over which you can hover with your mouse to reveal daily point estimates. Unfortunately, the numbers that compose those point estimates are a little tricky to tease out — at least, it was tricky for me. Fortunately, I managed to wrangle out the Romney vs. Obama daily averages, which you can download here [CSV].

Fortunately, RCP uses stores their time series data in XML, meaning that the method I used to get those Romney vs. Obama numbers can be used to collect any RCP data, such as from this comparison of Obama & Bush Job Approval. Just view source, and [CTRL-F] for “xml,” and try to identify the XML file from which the graph is drawing data:

In this case, the file appears to be o_vs_b6.xml, which we can find listed in this directory of all RCP XML files and graph-drawing code.

From there, you can just use the R package XML and the following code as a guide for neatly folding the XML data into a data.frame. It will take a little effort on your part (i.e. it’s not just “CTRL-A, CTRL-R”), but the XML should be consistently-formatted, and thus not too difficult to parse.

By d-sparks

Tags: rstats XML lubridate