Reading/Writing Stata (.dta) files with Foreign

Oftentimes we find ourselves collaborating with others who might not use R or prefer to use Stata to clean and manage their data. Luckily, there is the foreign package that permits handling data of different types (SAS, SPSS, Stata, etc.) within the R environment. The documentation can be found here: 


In today’s gist, I’ll show how to do the two most basic things one would probably want to do with a Stata (.dta) file: read it into R and write a dataframe from R into a new .dta file. Foreign makes this very easy to do. In this code the functions are going to look first in your current working directory for the .dta files, so please set the directory accordingly or specify the complete file path.


The first command you’ll want to use from within R is read.dta() which loads a Stata dataset. Here I’m using a small subset of the 2010 CCES that I have saved as “stata.dta”

As you can see, six of the seven variables in the data are factors. While factors are good sometimes  we can prevent some of the frustrations of working with them by using the “convert.factors=” option; when convert.factors is FALSE, R replaces the factor value with the underlying numeric value found in Stata. The values can be found in Stata using the “tab Var, nolab” option:

One other useful option within the read.dta() command is “convert.underscore” which can be used to remove underscores used in Stata variable names and replacing them with periods:


Writing data files from R into Stata is also very straightforward; To save your dataframe (DF) as a Stata file (fromR) you simply use write.dta(DF, “fromR.dta”). My example below uses the line:

write.dta(STATA, “fromR.dta”)

Of course, there are some additional options specifying how to deal with factors and dates, but that is discussed in the package documentation linked above.

Once you open the file in Stata you will see it is written by R:

Full code is below, enjoy:

By use-r-friendly

Tags: rstats foreign AdventCalendaR

Plotting letters as shapes in ggplot2

This post is a little more esoteric than most, but I found myself needing to solve this problem, so I’m just passing the solution on to you. The plot above shows the distribution of DW-NOMINATE scores for the 18th Congress, with party indicated by both color and shape. You will notice that there are more parties in 1824 than there are today — so many, in fact, that ggplot2 will resist plotting the seven shapes needed to account for each party. Note that I am confident that there is a good, peer-reviewed reason for this, so caveat emptor.

One work-around is to plot the initial letter of each party as a text geom, but in this case, the legend indicates the use of geom_text with an “a,” rather than an indicator for each shape. This is non-optimal, particularly if it’s not perfectly clear how the plotted letter symbols align with the party names:

The solution (or possibly, hack) is to use geom_shape, but use a custom scale that passes the numeric referents to each of the letters you want to use. To do so, we just need to choose the initial to use for each party name, and pass them to scale_shape_manual with utf8ToInt(). Incidentally, to manually look up the shape-to-numeric indicator correspondence, just run example(points).

By d-sparks

Tags: foreign ggplot2 devtools graphics rstats

Congressional ideology by state

In a recent post, I illustrated how to add a background geom to your ggplot. While that code worked, and the plot looked fine, it was pointed out to me that I was missing an important aspect of plot layering with ggplot2. Namely, it is not, as I previously claimed, necessary to add extra NULL variables to the background data.frame.

Fortunately, I was put on the right path by the inimitable Hadley Wickham, who pointed out that There is, of course, a Function for That: mutate()

This Gist correctly builds a layered plot, shows how mutate() works, and plots DW-NOMINATE House ideology in two-dimensions, by state, with an illustration of what I consider a very useful visualization technique — adding a reference distribution to each plot facet.

By d-sparks

Tags: rstats foreign ggplot2 plyr graphics

Adding a background to your ggplot

I really enjoy using the DW-NOMINATE data for examples, as I do here. Sometimes it’s useful to indicate regions in the background of a plot — perhaps two-dimensional regions of interest, perhaps one-dimensional periods in time. It’s not always obvious how to combine data from two data.frames to form one plot in ggplot2, so here is another example.

The trick seems to be that the “second” data frame needs to include all of the same variables as you are using from the “first” data frame (in name, at least — that is, if you are plotting variables called “x”, “y”, and “z” from the “first” data frame, your second data.frame needs to include variables names ”x”, “y”, and “z,” even if you’re not plotting with those, and even if they are assigned equal to some arbitrary constant, as in df2$z <- 1).

The distribution of ideology in the U.S. House (with plyr)

This Gist has a couple of things going on, I’ll just list them:

  1. It downloads the entire history of U.S. House DW-NOMINATE scores from voteview.com
  2. It evaluates aggregate statistics, with members grouped by party and congress, with easy weighted functions from Hmisc.
  3. It does the aggregating and data.frame conversion all in one very easy step, using plyr. I have always done this type of aggregation in other ways (by(), *apply(), etc.), but plyr sure made it easy.
  4. It makes a really pretty nice plot of the distribution of first-dimension ideological ideal points, by party, over time.

By d-sparks

Tags: foreign plyr Hmisc ggplot2 graphics rstats