Possibly slightly better text analysis with lme4


lme4 and its cousin arm are extremely useful for a huge variety of modeling applications (see Gelman and Hill’s book), but today we’re going to do something a little frivolous with them. Namely, we’re going to extend our Denver Debate analysis to include some sense of error.

Instead of the term-frequency scatter plot seen in the previous post, this code fits the most basic possible partially-pooled model predicting which of the two candidates, Obama or Romney, spoke a given term. This allows us to get a slightly better idea of which candidate “owned” a term on the night, and simultaneously accounts for volume of usage (evidenced by narrower confidence intervals).

Anyway, we will almost certainly return to lmer() at some point in the future, but this code offers some ideas as to how best translate a model object into a data frame amenable to plotting.


Text analysis made too easy with the tm package

Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.

There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.

Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)

Everything is a Network, featuring the sna package

We’ve gotten some requests, through the Ask us anything page, to do some plotting of networks. We may come back to this later, but today’s Gist shows how you can plot pretty much literally anything as a network.

First, we go back to our well-worn folder of flag PNGs from GoSquared, and load data for each pixel of each flag. Then, we binarize the dissimilarity matrix of these flags, with a cutoff chosen to ensure that the entire graph is a single connected component (this is done just for the purposes of this example; in Real Life, you are likely to have an actual network you want to plot).

Then, we plot the network conventionally, using gplot from sna, but save the vertex coordinates. Finally, we replot the graph edges put overplot the vertices with the flag rasters that we have come to know and love.

Fun “fact”: the flag of the Seychelles has the highest eigenvector centrality, while the flag of the Vatican City has the lowest!

By d-sparks

Tags: rstats png sna AdventCalendaR

Fuzzy clustering with fanny()

This is kind of a fun example, and you might find the fuzzy clustering technique useful, as I have, for exploratory data analysis. In this Gist, I use the unparalleled breakfast dataset from the smacof package, derive dissimilarities from breakfast item preference correlations, and use those dissimilarities to cluster foods.

Fuzzy clustering with fanny() is different from k-means and hierarchical clustering, in that it returns probabilities of membership for each observation in each cluster. Here, I ask for three clusters, so I can represent probabilities in RGB color space, and plot text in boxes with the help of this StackOverflow answer.

The colors and the MDS configuration highlight the three primary clusterings of breakfast items into what we’ll call a muffin group, a bread group, and a sweet group. Of course, cluster identification is a subjective exercise, made even more so by use of probabilistic membership, but I’m pretty happy with this breakfast analysis.

Multidimensional metric unfolding with SMACOF

SMACOF stands for “Scaling by MAjorizing a COmplicated Function,” and it is a multidimensional scaling algorithm for metric unfolding of, among other things, rectangular ratings matrices.

One neat Political Science application of MDS is inferring ideology from survey thermometer ratings. The 2008 ANES featured 43 different thermometer stimuli, and today’s Gist shows how to use SMACOF to simultaneously scale survey respondents and thermometer stimuli in the same space, and to compare this measure of inferred ideology across partisans.

I’ve also got a little piece of code that replaces numeric axis labels with names of the stimuli, which I think might be better, as the numbers don’t really mean much except in comparison with the stimuli. Let me know what you think!

By d-sparks

Tags: ggplot2 smacof rstats AdventCalendaR

US State Maps using map_data()


Today’s short post will show how to make a simple map using map_data().

Let’s assume you have data in a CSV file that may look like this:


Notice the lower case state names; they will make merging the data much easier. The variable of interest we’re going to plot is the relative incarceration rates by race (whites and blacks) across each of the fifty states (we’ll remove DC once we load the data). Using the map_data(“state”) command, we can load a data.frame called “all_states”, shown below:


Merging that data with the data frame we have as a CSV produces:


We can then plot each state and shade it by our variable of interest:


Full code is below:

By use-r-friendly

Tags: graphics rstats ggplot2 AdventCalendaR

Anonymous said: Can you please post the R code for making that beautiful Advent CalendarR? Pretty please. I've been trying to get it right but no luck :(

I’m glad you like it! The code is a simple loop, drawing 24 open circles, some filled circles, and plotting numbers inside of those. Stripping out all of the axes and labels leaves us with a white field full of dots.

By d-sparks

Tags: AdventCalendaR

"Economics-style" graphs with bezier() from Hmisc

So, I really think this one is pretty cool. We spend much of our time in R making graphs with data, but what if you have a theory that you’d like to express graphically? Something like what I’ll call “economics-style” graphs, illustrating, for example, the Solow growth model, a production–possibility frontier, or an indifference curve?

Well, rest assured that R can produce those, too, and it’s made simple by the bezier() function from Hmisc (Hmisc does a lot of other interesting things, but this is what you got in today’s Advent CalendaR slot). Bézier curves are a workhorse of vector graphics, and if you’re not familiar with them, I encourage you to become so, with this beautiful interactive demo and with this more detailed interactive demo.

The Gist shows you how to use Bézier curves to replicate Wikipedia’s Supply-and-Demand graph, and is pretty heavily commented, but I’ll add a few notes:

  • Generating a Bézier curve with pre-specified x and y vectors takes some trial-and-error. Fortunately, it is usually a fun puzzle and it’s very quick to test. Just think of each point as “pulling” the curve toward itself.
  • The script defines a hacky little function called approxIntersection(), which is intended to let you input two (x, y) vectors and will output their approximate intersection. This probably doesn’t work well in a lot of cases, and I would be interested in hearing of anyone’s less hacky solutions.
  • Earlier drafts of this code required a bit of ggplot2 theme-wrangling, but with the release of ggplot2 0.9.3, theme_classic now produces the exact look I was going for.

Handling missing data with Amelia

So, what if you have data, but some of the observations are missing? Many statistical techniques assume no missingness, so we might want to “fill in” or rectangularize our data, by replacing missing observations with plausible substitutes. There are many ways of going about this, but one of the most robust and accessible is through the Amelia package.

Today’s Gist applies multiple imputation to some sample ANES survey data, and compares listwise-deleted regression results to results pooled from the same regression run on ten imputed data sets. Amelia makes this imputation, modeling, and recombination straightforward, and I’ve thrown in a nice coefficient plot (using position_dodge!) to illustrate the differences between missing data approaches.

By d-sparks

Tags: rstats Amelia ggplot2 AdventCalendaR

Evaluating term popularity with twitteR

I really wanted to put something together for this series on the twitteR package. Unfortunately, at the moment the number of interesting things than can be done with twitteR, as opposed to through API calls and RCurl, is limited. Regardless, I have Yet Another Invented Application to illustrate a pretty typical use-case for twitteR: grabbing Tweets by search term.

I’ve done this before, for sentiment analysis of Tweets about Republican presidential primary candidates, and indeed, despite its limitations, the searchTwitter() function can be useful. Since the number of Tweets one can grab appears to be limited to 1000, this Gist attempts to infer term popularity by frequency — with only minor success, as you can see in the plot below.