Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

results <- ddply(some_data, "key", function(x) {
  ## do something; return data.frame()

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) {
  ## do something; return data.frame()
result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

grouped <- group_by(some_data, key)
result <- do(grouped, calculate(.))

Or once again with magrittr:

some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key)
## will not work: Error: Results are not data frames at positions ...
try(result <- do(grouped, calculate))

Don’t forget the dot!

It seems dplyr is overtaking correlation heatmaps

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:


data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
  group_by(sex, treatment, variable) %>%
  summarise(mean = mean(value), sd = sd(value))

Row by row we:

1-3: Load the packages.

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

    sex treatment  variable        mean        sd
  (dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059