Using R: quickly calculating summary statistics from a data frame

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))
head(data)

  sex treatment   response1   response2
1   1         1 -0.15668214 -0.13663012
2   1         2 -0.40934759 -0.07220426
3   1         1  0.07103731 -2.60549018
4   1         2  0.15113270  1.81803178
5   1         1  0.30836910  0.32596016
6   1         2 -1.41891407  1.12561812

Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)/
  sqrt(nrow(subset(data, sex == 1 & treatment == 1)))

Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr)
library(reshape2)

First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))

The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

  sex treatment  variable       value
1   1         1 response1 -0.15668214
2   1         2 response1 -0.40934759
3   1         1 response1  0.07103731
4   1         2 response1  0.15113270
5   1         1 response1  0.30836910
6   1         2 response1 -1.41891407

Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise,
      mean = mean(value), sd = sd(value),
      sem = sd(value)/sqrt(length(value)))

  sex treatment  variable         mean        sd        sem
1   1         1 response1  0.021856280 1.0124371 0.04527757
2   1         1 response2  0.045928150 1.0151670 0.04539965
3   1         2 response1 -0.065017971 0.9825428 0.04394065
4   1         2 response2  0.011512867 0.9463053 0.04232006
5   2         1 response1 -0.005374208 1.0095468 0.04514830
6   2         1 response2 -0.051699624 1.0154782 0.04541357
7   2         2 response1  0.046622111 0.9848043 0.04404179
8   2         2 response2 -0.055257295 1.0134786 0.04532414

10 reaktioner på ”Using R: quickly calculating summary statistics from a data frame”

Terry

26 mars, 2014 kl. 09:28

Using a data.table is another great option.
- mrtnj
  
  26 mars, 2014 kl. 10:09
  
  Yes, I’ve not used it myself, but so I’ve been told! By the way, I think dplyr works with data tables.
Kazuki Yoshida

26 mars, 2014 kl. 20:40

dplyr is also useful and faster for large data frames.

library(dplyr)
data %.%
dplyr::group_by(sex,treatment) %.%
dplyr::summarize(mean1 = mean(response1),
sd1 = sd(response1),
mean2 = mean(response2),
sd2 = sd(response2))

For biomedical folks, my tableone package may be of interest ( https://github.com/kaz-yos/tableone ).

library(tableone)
CreateTableOne(c(”response1″,”response2”), c(”sex”,”treatment”),data)
- mrtnj
  
  26 mars, 2014 kl. 20:47
  
  Thank you for that! Actually, today’s post is going to be about dplyr … However, dplyr is not what I use in my day-to-day work, so I’ll not recommend to others until I’ve gotten used to it 🙂
Daniel

26 mars, 2014 kl. 20:54

What about the ”describe” function in the ”psych” package? I’ve included that function in my package to create HTML table outputs in the viewer pane or web browser (http://strengejacke.wordpress.com/2014/03/04/beautiful-table-outputs-in-r-part-2-rstats-sjplot).
- mrtnj
  
  26 mars, 2014 kl. 22:07
  
  Also seems like a convenient solution for summary statistics!
Pingback: Using R: quickly calculating summary statistics (with dplyr) | There is grandeur in this view of life
samsamg

27 mars, 2014 kl. 23:18

It’s worth to show how use oldish/default functions.
set.seed() allows to reproduce the computation.
Always avoid loops as you said.
HTH

> set.seed(1); data head(data)
sex treatment response1 response2
1 1 1 -0.6264538 -0.88614959
2 1 2 0.1836433 -1.92225490
3 1 1 -0.8356286 1.61970074
4 1 2 1.5952808 0.51926990
5 1 1 0.3295078 -0.05584993
6 1 2 -0.8204684 0.69641761
> aggregate(data, by=list(data$sex, data$treatment), mean)
Group.1 Group.2 sex treatment response1 response2
1 1 1 1 1 -0.027716483 0.04982330
2 2 1 2 1 -0.063763436 0.04657828
3 1 2 1 2 0.004420199 -0.01920523
4 2 2 2 2 0.031239613 -0.01313378
> aggregate(data, by=list(data$sex, data$treatment), sd)
Group.1 Group.2 sex treatment response1 response2
1 1 1 0 0 1.0371566 1.008409
2 2 1 0 0 1.0959981 1.024779
3 1 2 0 0 1.0334586 1.053195
4 2 2 0 0 0.9795697 1.052524
> aggregate(data, by=list(data$sex, data$treatment), function(x) sd(x)/sqrt(length(x)))
Group.1 Group.2 sex treatment response1 response2
1 1 1 0 0 0.04638305 0.04509740
2 2 1 0 0 0.04901452 0.04582949
3 1 2 0 0 0.04621767 0.04710029
4 2 2 0 0 0.04380769 0.04707031
Pingback: It seems dplyr is overtaking correlation heatmaps | On unicorns and genes
gabriela

2 augusti, 2017 kl. 14:16

thaaaaaank youuuuuuuuuuuuuuuuuuuuuuuuuuuuu

Kommentarer är stängda.

Dela detta:

Relaterade

10 reaktioner på ”Using R: quickly calculating summary statistics from a data frame”