Latent Class Analysis with poLCA

On an airplane the other day, I learned of a method called latent class (transition) analysis, and it sounded like an interesting thing to try in R. Of course, as with everything R, There is a Package for That, called poLCA, written by none other than Drew Linzer (of Votamatic fame) and Jeffrey Lewis.

I wasn’t able to think of a good application for transition analysis specifically, but I did use Christopher’s ANES data to estimate latent “types” of respondents. The example model illustrates a four-class model, and I’ll leave it as an exercise for the interested reader to assign subjective names to each class.

This Gist also attempts to improve on the default plot both by eschewing the 3-D effect, and by putting classes, rather than variables, in direct comparison with one another. Also, for what it’s worth, the plot code shows how to draw a bar plot when you have already computed counts or proportions — use stat=”identity”.

Thanks for celebrating Advent with us, and for your feedback and support. We’re taking a little break after tomorrow’s post, but we’ll be back better than ever next year!

Measuring the Gerrymander with spatstat

Well, to be specific, I mean measuring district compactness (a very interesting subject, see these three articles for starters). There are myriad ways of measuring the “oddness” of a shape, including a comparison of the area of the district to its circumcircle, the moment of inertia of the shape, the probability that a path connecting two random points will pass through the polygon, etc.

In today’s Gist, I use the spatstat package to convert Congressional district shapefiles to owin objects, which can be very persnickety — meaning that for our present purposes I have just skipped over districts with overlapping polygons or other owin conversion obstacles. However, spatstat lets us do neat things with owin objects, including the calculation of the area and perimeter of polygons, which I use to compute and then plot a simple Area / Perimeter ratio measure of district compactness.

As you can see in the guilty-pleasure Spectral palette choropleth below (click it for a larger view), the least compact districts are unsurprisingly typically found in high-population-density areas. Also, you can use this map to find your way from Greensboro to Charlotte, via I-85.

The definitive guide to plotting confidence intervals in R


Here at is.R(), we have produced countless posts that feature plots with confidence intervals, but apparently none of those are easy to find with Google. So, today, for the purposes of SEO, we’ve put “plotting confidence intervals” in the title of our post.

We also cannot resist an earnest plea from our Political Science colleagues, who managed to find our Ask us anything page, and whom we would hate to disappoint. It is worth mentioning that there are some alternatives to, and critiques of, this particular style of coefficient plot, and we may return to the subject at a later date.

But, for representing an arbitrary number of confidence intervals from an arbitrary number of models, this code should work:

Beautiful network diagrams with ggplot2


I don’t usually like describing my own work as “beautiful,” but with your permission I will make an exception today. There have been some requests for scripts illustrating the plotting of network diagrams with ggplot2, and today (for the winter solstice) we’re bringing you a really nice-looking way of doing just that.

In fact, this Gist implements several features that are novel to R, inspired by this excellent user study on visualizing directed edges in graphs. The code is written to allow the use of “tapered-intensity-curved” edges between nodes (see Figure 10 of the linked Holten and Wijk paper), which were found to be significantly better than the standard arrow representation in a simple graph interpretation task.

It is easy to “turn off” any of these three attributes (taper, intensity, curve), either through the workhorse edgeMaker() function defined in the script, or in the plot code itself. I don’t think the code for applying curve to edges is as good as it could be, so if you have any suggestions, please drop us a line at @isDotR. Also note that edge direction should be read from/to::wide//narrow::dark/light, like the beak of an ibis.

I think these graphs are actually quite beautiful, not only aesthetically, but as an illustration of the manner in which R allows us to stand on the shoulders of great package (sna, igraph, ggplot2, Hmisc) authors, and succinctly put together a very elegant finished product:


Geocoding location data with dismo

Today’s Gist could actually end up being very useful to a number of you. It’s something of a trumped-up example, but it illustrates in very simple code how to do three interesting things:

  1. Gather Tweets by search term (which we’ve done before), and look up user info for each of the users returned by that search.
  2. Convert textual user location data to approximate latitude & longitude coordinates with the Google geocoding web-service, using a single function, geocode(), from the dismo package. This is a revelation to me, and though there appears to be a daily rate limit, I can imagine so many applications for which this would be useful.
  3. Very easily plot a world map (albeit with a lame projection), and superimpose points indicating the inferred location of #rstats-Tweeting users.

And all in just 29 (+/-) lines. Truly, truly, we are living in a great era for statistical computing.

The Inverse Herfindahl–Hirschman Index as an “Effective Number of” Parties

I learned of the passing of Albert Hirschman on December 11, and while better and more instructive tributes to his life can be read elsewhere, I wanted to focus on a little piece of Hirschman’s work that I use all the time: the (inverse) Herfindahl–Hirschman Index.

The HHI is basically a measure of market concentration, but when inverted, it is an “effective number of” whatever grouping you might be interested in, such as parties. Essentially, this statistic can be interpreted as, “Individuals are distributed across groups in such a way that they are as concentrated as they would be if divided across [HHI value] groups evenly.”

This is perhaps best understood by example, and fortunately, my field of American Politics offers an interesting one. The U.S. South, between Reconstruction and the Civil Rights Act, was commonly known as the “one-party South,” due to the overwhelming dominance of the Democratic Party in Southern Politics. We can see evidence of this dominance by calculating the Effective Number of Parties-in-the-Electorate, using the HHI.

As the graph below illustrates, non-Southern states have consistently featured just over two “effective” parties (Democrats, Republicans, and some Independents/Others), while the South lagged behind in this measure up until the 1980s.

The inverse HHI is an elegant little function (the square of the sum over the sum of the squares), and plyr makes it very easy to calculate for any dataset.

By d-sparks

Tags: rstats plyr ggplot2 AdventCalendaR

Possibly slightly better text analysis with lme4


lme4 and its cousin arm are extremely useful for a huge variety of modeling applications (see Gelman and Hill’s book), but today we’re going to do something a little frivolous with them. Namely, we’re going to extend our Denver Debate analysis to include some sense of error.

Instead of the term-frequency scatter plot seen in the previous post, this code fits the most basic possible partially-pooled model predicting which of the two candidates, Obama or Romney, spoke a given term. This allows us to get a slightly better idea of which candidate “owned” a term on the night, and simultaneously accounts for volume of usage (evidenced by narrower confidence intervals).

Anyway, we will almost certainly return to lmer() at some point in the future, but this code offers some ideas as to how best translate a model object into a data frame amenable to plotting.


Text analysis made too easy with the tm package

Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.

There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.

Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)

Fuzzy clustering with fanny()

This is kind of a fun example, and you might find the fuzzy clustering technique useful, as I have, for exploratory data analysis. In this Gist, I use the unparalleled breakfast dataset from the smacof package, derive dissimilarities from breakfast item preference correlations, and use those dissimilarities to cluster foods.

Fuzzy clustering with fanny() is different from k-means and hierarchical clustering, in that it returns probabilities of membership for each observation in each cluster. Here, I ask for three clusters, so I can represent probabilities in RGB color space, and plot text in boxes with the help of this StackOverflow answer.

The colors and the MDS configuration highlight the three primary clusterings of breakfast items into what we’ll call a muffin group, a bread group, and a sweet group. Of course, cluster identification is a subjective exercise, made even more so by use of probabilistic membership, but I’m pretty happy with this breakfast analysis.

Multidimensional metric unfolding with SMACOF

SMACOF stands for “Scaling by MAjorizing a COmplicated Function,” and it is a multidimensional scaling algorithm for metric unfolding of, among other things, rectangular ratings matrices.

One neat Political Science application of MDS is inferring ideology from survey thermometer ratings. The 2008 ANES featured 43 different thermometer stimuli, and today’s Gist shows how to use SMACOF to simultaneously scale survey respondents and thermometer stimuli in the same space, and to compare this measure of inferred ideology across partisans.

I’ve also got a little piece of code that replaces numeric axis labels with names of the stimuli, which I think might be better, as the numbers don’t really mean much except in comparison with the stimuli. Let me know what you think!

By d-sparks

Tags: ggplot2 smacof rstats AdventCalendaR