I learned of the passing of Albert Hirschman on December 11, and while better and more instructive tributes to his life can be read elsewhere, I wanted to focus on a little piece of Hirschman’s work that I use all the time: the (inverse) Herfindahl–Hirschman Index.
The HHI is basically a measure of market concentration, but when inverted, it is an “effective number of” whatever grouping you might be interested in, such as parties. Essentially, this statistic can be interpreted as, “Individuals are distributed across groups in such a way that they are as concentrated as they would be if divided across [HHI value] groups evenly.”
This is perhaps best understood by example, and fortunately, my field of American Politics offers an interesting one. The U.S. South, between Reconstruction and the Civil Rights Act, was commonly known as the “one-party South,” due to the overwhelming dominance of the Democratic Party in Southern Politics. We can see evidence of this dominance by calculating the Effective Number of Parties-in-the-Electorate, using the HHI.
As the graph below illustrates, non-Southern states have consistently featured just over two “effective” parties (Democrats, Republicans, and some Independents/Others), while the South lagged behind in this measure up until the 1980s.
The inverse HHI is an elegant little function (the square of the sum over the sum of the squares), and plyr makes it very easy to calculate for any dataset.
In a recent post, I illustrated how to add a background geom to your ggplot. While that code worked, and the plot looked fine, it was pointed out to me that I was missing an important aspect of plot layering with ggplot2. Namely, it is not, as I previously claimed, necessary to add extra NULL variables to the background data.frame.
Fortunately, I was put on the right path by the inimitable Hadley Wickham, who pointed out that There is, of course, a Function for That: mutate()
This Gist correctly builds a layered plot, shows how mutate() works, and plots DW-NOMINATE House ideology in two-dimensions, by state, with an illustration of what I consider a very useful visualization technique — adding a reference distribution to each plot facet.
I really enjoy using the DW-NOMINATE data for examples, as I do here. Sometimes it’s useful to indicate regions in the background of a plot — perhaps two-dimensional regions of interest, perhaps one-dimensional periods in time. It’s not always obvious how to combine data from two data.frames to form one plot in ggplot2, so here is another example.
The trick seems to be that the “second” data frame needs to include all of the same variables as you are using from the “first” data frame (in name, at least — that is, if you are plotting variables called “x”, “y”, and “z” from the “first” data frame, your second data.frame needs to include variables names ”x”, “y”, and “z,” even if you’re not plotting with those, and even if they are assigned equal to some arbitrary constant, as in df2$z <- 1).
This Gist has a couple of things going on, I’ll just list them:
- It downloads the entire history of U.S. House DW-NOMINATE scores from voteview.com
- It evaluates aggregate statistics, with members grouped by party and congress, with easy weighted functions from Hmisc.
- It does the aggregating and data.frame conversion all in one very easy step, using plyr. I have always done this type of aggregation in other ways (by(), *apply(), etc.), but plyr sure made it easy.
- It makes a really pretty nice plot of the distribution of first-dimension ideological ideal points, by party, over time.
A long title, but there are a couple of handy things in this Gist. The first, and more obscure, is the conversion of a data.frame of categorical variables into a matrix of dummy/binary/indicator variables, one for each category of each original variable.
It is non-obvious (to me, at least) how to best do this, so the solution comes from “Gavin Simpson” and “fabians” at Stack Overflow.
The second part of this Gist shows how to construct a table of log odds ratios between each of these indicator variables, which may be a first step in the estimation of something like (but not exactly the same as) multiple correspondence analysis.