lme4 and its cousin arm are extremely useful for a huge variety of modeling applications (see Gelman and Hill’s book), but today we’re going to do something a little frivolous with them. Namely, we’re going to extend our Denver Debate analysis to include some sense of error.
Instead of the term-frequency scatter plot seen in the previous post, this code fits the most basic possible partially-pooled model predicting which of the two candidates, Obama or Romney, spoke a given term. This allows us to get a slightly better idea of which candidate “owned” a term on the night, and simultaneously accounts for volume of usage (evidenced by narrower confidence intervals).
Anyway, we will almost certainly return to lmer() at some point in the future, but this code offers some ideas as to how best translate a model object into a data frame amenable to plotting.
Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.
There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.
Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)
The zoo package is designed for use with (potentially irregular) time series data. It is widely used for any number of applications, but among its most frequently useful functions are the roll* functions, such as rollmean, rollmedian, rollmax, rollapply, etc.
Today’s Gist shows you how to use these rolling functions to summarize time series data across a moving window. That is, you can calculate any function on a 5-day (or second, or year) basis, across the length of the entire vector. This is certainly something that could be done with a simple loop, but the roll* functions make it easy and fast.