# Toying with models: The Game of Life with selection

Conway’s Game of life is probably the most famous cellular automaton, consisting of a grid of cells developing according simple rules. Today, we’re going to add mutation and selection to the game, and let patterns evolve.

The fate of a cell depends on the number cells that live in the of neighbouring positions. A cell with fewer than two neighbours die from starvation. A cell with more than three neighbours die from overpopulation. If a position is empty and has three neighbours, it will be filled by a cell. These rules lead to some interesting patterns, such as still lives that never change, oscillators that alternate between states, patterns that eventually die out but take long time to do so, patterns that keep generating new cells, and so forth.

When I played with the Game of life when I was a child, I liked one pattern called ”virus”, that looked a bit like this. On its own, a grid of four-by-four blocks is a still life, but add one cell (the virus), and the whole pattern breaks. This is a version on a 30 x 30 cell board. It unfolds rather slowly, but in the end, a glider collides with a block, and you are left with some oscillators.

There are probably other interesting ways that evolution could be added to the game of life. We will take a hierarchical approach where the game is taken to describe development, and the unit of selection is the pattern. Each generation, we will create a variable population of patterns, allow them to develop and pick the fittest. So, here the term ”development” refers to what happens to a pattern when applying the rules of life, and the term ”evolution” refers to how the population of patterns change over the generations. This differ slightly from Game of life terminology, where ”evolution” and ”generation” usually refer to the development of a pattern, but it is consistent with how biologists use the words: development takes place during the life of an organism, and evolution happens over the generations as organisms reproduce and pass on their genes to offspring. I don’t think there’s any deep analogy here, but we can think of the initial state of the board as the heritable material that is being passed on and occasionally mutated. We let the pattern develop, and at some point, we apply selection.

First, we need an implementation of the game of life in R. We will represent the board as a matrix of ones (live cells) and zeroes (empty positions). Here is function develops the board one tick in time. After dealing with the corners and edges, it’s very short, but also slow as molasses. The next function does this for a given number of ticks.

## Develop one tick. Return new board matrix.
develop <- function(board_matrix) {
padded <- rbind(matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2),
cbind(matrix(0, ncol = 1, nrow = nrow(board_matrix)),
board_matrix,
matrix(0, ncol = 1, nrow = nrow(board_matrix))),
matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2))
for (i in 2:(nrow(padded) - 1)) {
for (j in 2:(ncol(padded) - 1)) {
if (neighbours < 2 | neighbours > 3) {
new_board[i, j] <- 0
}
if (neighbours == 3) {
new_board[i, j] <- 1
}
}
}
}

## Develop a board a given number of ticks.
tick <- function(board_matrix, ticks) {
if (ticks > 0) {
for (i in 1:ticks) {
board_matrix <- develop(board_matrix)
}
}
board_matrix
}


We introduce random mutations to the board. We will use a mutation rate of 0.0011 per cell, which gives us a mean of a bout one mutation for a 30 x 30 board.

## Mutate a board
mutate <- function(board_matrix, mutation_rate) {
mutated <- as.vector(board_matrix)
outcomes <- rbinom(n = length(mutated), size = 1, prob = mutation_rate)
for (i in 1:length(outcomes)) {
if (outcomes[i] == 1)
mutated[i] <- ifelse(mutated[i] == 0, 1, 0)
}
matrix(mutated, ncol = ncol(board_matrix), nrow = nrow(board_matrix))
}


I was interested in the virus pattern, so I decided to apply a simple directional selection scheme for number of cells at tick 80, which is a while after the virus pattern has stabilized itself into oscillators. We will count the number of cells at tick 80 and call that ”fitness”, even if it actually isn’t (it is a trait that affects fitness by virtue of the fact that we select on it). We will allow the top half of the population to produce two offspring each, thus keeping the population size constant at 100 individuals.

## Calculates the fitness of an individual at a given time
get_fitness <- function(board_matrix, time) {
board_matrix %>% tick(time) %>% sum
}

## Develop a generation and calculate fitness
grow <- function(generation) {
generation$fitness <- sapply(generation$board, get_fitness, time = 80)
generation
}

## Select a generation based on fitness, and create the next generation,
next_generation <- function(generation) {
keep <- order(generation$fitness, decreasing = TRUE)[1:50] new_generation <- list(board = vector(mode = "list", length = 100), fitness = numeric(100)) ix <- rep(keep, each = 2) for (i in 1:100) new_generation$board[[i]] <- generation$board[[ix[i]]] new_generation$board <- lapply(new_generation$board, mutate, mutation_rate = mu) new_generation } ## Evolve a board, with mutation and selection for a number of generation. evolve <- function(board, n_gen = 10) { generations <- vector(mode = "list", length = n_gen) generations[[1]] <- list(board = vector(mode = "list", length = 100), fitness = numeric(100)) for (i in 1:100) generations[[1]]$board[[i]] <- board
generations[[1]]$board <- lapply(generations[[1]]$board, mutate, mutation_rate = mu)

for (i in 1:(n_gen - 1)) {
generations[[i]] <- grow(generations[[i]])
generations[[i + 1]] <- next_generation(generations[[i]])
}
generations[[n_gen]] <- grow(generations[[n_gen]])
generations
}


Let me now tell you that I was almost completely wrong about what happens with this pattern once you apply selection. I thought that the initial pattern of nine stable blocks (36 cells) was pretty good, and that it would be preserved for long, and that virus-like patterns (like the first animation above) would mostly have degenerated around 80. As this plot of the evolution of the number of cells in one replicate shows, I grossly underestimated this pattern. The y-axis is number of cells at time 80, and the x-axis individuals, the vertical lines separating generations. Already by generation five, most individuals do better than 36 cells in this case:

As one example, here is the starting position and the state at time 80 for a couple of individuals from generation 10 of one of my replicates:

Here is how the average cell number at time 80 evolves in five replicates. Clearly, things are still going on at generation 10, not only in the replicate shown above.

Here is the same plot for the virus pattern I showed above, i.e. the blocks but with one single added cell, fixed in the starting population. Prior genetic architecture matters. Even if the virus pattern has fewer cells than the blocks pattern at time 80, it is apparently a better starting point to quickly evolve more cells:

And finally, out of curiosity, what happens if we start with an empty 30 x 30 board?

Not much. The simple still life block evolves a lot. But in my replicate three, this creature emerged. ”Life, uh, finds a way.”

Unfortunately, many of the selected patterns extended to the edges of the board, making them play not precisely the game of life, but the game of life with edge effects. I’d like to use a much bigger board and see how far patterns extend. It would also be fun to follow them longer. To do that, I would need to implement a more efficient way to update the board (this is very possible, but I was lazy). It would also be fun to select for something more complex, with multiple fitness components, potentially in conflict, e.g. favouring patterns that grow large at a later time while being as small as possible at an earlier time.

Code is on github, including functions to display and animate boards with the animation package and ImageMagick, and code for the plots. Again, the blocks_selection.R script is slow, so leave it running and go do something else.

# Toying with models: The Luria–Delbrück fluctuation test

I hope that Genetics will continue running expository papers about their old classics, like this one by Philip Meneely about Luria & Delbrück (1943). Luria & Delbrück performed an experiment on bacteriophage resistance in Escherichia coli, growing bacterial cultures, exposing them to a phage, and then plating and counting the survivors, who have become resistant to the phage. They considered two hypotheses: either resistance occurs adaptively, in response to the phage, or it occurs by mutation some time during the growth of the culture but before the phages are added. They find the latter to be the case, and this is an example of how mutations happen irrespective of their effects of fitness, in a sense at random. Their analysis is based on a model of bacterial growth and mutation, and the aim of this exercise is to explore this model by simulating some data.

First, we assume that mutation happens with a fixed mutation rate $\mu = 2 \cdot 10^{-8}$, which is quite close to their estimated value, and that the mutation can’t reverse. We also assume that the bacteria grow by doubling each generation up to 30 generations. We start a culture from a single susceptible bacterium, and let it grow for a number of generations before the phage is added. (We’re going to use discrete generations, while Luria & Delbrück use a continuous function.) Then:

$n_{susceptible,i+1}= 2 (n_{susceptible,i} - n_{mutants,i})$

$n_{resistant,i+1} = 2 (n_{resistant,i} + n_{mutants,i})$

That is, every generation i, the mutants that occur move from the susceptible to the resistant category. The number of mutants that happen among the susceptible is binomially distributed:

$n_{mutants,i} \sim Binomial(n_{susceptible,i}, \mu)$.

This is an R function to simulate a culture:

culture <- function(generations, mu) {
n_susceptible <- numeric(generations)
n_resistant <- numeric(generations)
n_mutants <- numeric(generations)
n_susceptible[1] <- 1
for (i in 1:(generations - 1)) {
n_mutants[i] <- rbinom(n = 1, size = n_susceptible[i], prob = mu)
n_susceptible[i + 1] &lt;- 2 * (n_susceptible[i] - n_mutants[i])
n_resistant[i + 1] &lt;- 2 * (n_resistant[i] + n_mutants[i])
}
data.frame(generation = 1:generations,
n_susceptible,
n_resistant,
n_mutants)
}
cultures <- replicate(1000, culture(30, 2e-8), simplify = FALSE)


We run a few replicate cultures and plot the number of resistant bacteria. This graph shows the point pretty well: Because of random mutation and exponential growth, the cultures where mutations happen to arise relatively early will give rise to a lot more resistant bacteria than the ones were the first mutations are late. Therefore, there will be a lot of variation between the cultures because of their different histories.

combined <- Reduce(function (x, y) rbind(x, y), cultures)
combined$culture <- rep(1:1000, each = 30) resistant_plot <- qplot(x = generation, y = n_resistant, group = culture, data = combined, geom = "line", alpha = I(1/10), size = I(1)) + theme_bw()  We compare this to what happens under the alternative hypothesis where resistance arises as a consequence of introduction of the phage with some resistance rate (this is not the same as the mutation rate above, even though we’re using the same value). Then the number of resistant cells in a culture will be: $n_{acquired} \sim Binomial(2^{29}, \mu_{aquried})$. resistant <- unlist(lapply(cultures, function(x) max(x$n_resistant)))

acquired_resistant <- rbinom(n = 1000, size = 2^29, 2e-8)

resistant_combined <- rbind(transform(data.frame(resistant = acquired_resistant), model = "acquired"),
transform(data.frame(resistant = resistant), model = "mutation"))

resistant_histograms <- qplot(x = resistant, data = resistant_combined,bins = 10) +
facet_wrap(~ model, scale = "free_x")


Here are two histograms side by side to compare the cases. The important thing is the shape. If the acquired resistance hypothesis holds, the number of resistant bacteria in replicate cultures follows a Poisson distribution, because it arises when one counts the number of binomially distributed events that occur in a given number of trials. The interesting thing about the Poisson distribution in this case is that its mean is equal to the variance. However, under the mutation model (as we’ve already illustrated), there is a lot of variation between cultures. These fluctuations make the variance much larger than the mean, which is also what Luria and Delbrück found in their data. Therefore, the results are inconsistent with acquired mutation, and hence the experiment is called the Luria–Delbrück fluctuation test.

mean(resistant)
var(resistant)
mean(acquired_resistant)
var(acquired_resistant)


Literature

Luria, S. E., & Delbrück, M. (1943). Mutations of bacteria from virus sensitivity to virus resistance. Genetics, 28(6), 491.

Meneely, P. M. (2016). Pick Your Poisson: An Educational Primer for Luria and Delbrück’s Classic Paper. Genetics, 202(2), 371-375.

# R in genomics @ SciLifeLab, Solna

Dear diary,

I went to the Stockholm R useR group meetup on R in genomics at the Stockholm node of SciLifeLab. It was nice. If I had worked a bit closer I would attend meetups all the time. I even got to be pretentious with my notebook while waiting for the train.

The speakers were:

Jakub Orzechowski Westholm on R and genomics in general. He demonstrated genome browser-style tracks with Gviz, some GenomicRanges, and a couple of common plots of gene expression data. I have been on the fence about what package I should use for drawing genes and variants along the genome. I should play with Gviz.

Daniel Klevebring on clinical sequencing and how he uses R (not that much) in sequencing pipelines aimed at targeting the right therapy to patients based on the mutations in their cancer cells. He mentioned some getopt snippets for getting R to play nicely on the command line, which is something I should definitely try more!

Finally, Arvind Singh Mer on predictive modelling for clinical genomics (like the abovementioned ClinSeq data). He showed the caret package for machine learning, with an elastic net regression.

I don’t know the rest of the audience, so maybe the choice to gear talks towards the non-bio* person was spot on, but that made things a bit less interesting for me. For instance, in Jakub’s talk about gene expression, I would’ve preferred more about the messy stuff: how to make that nice gene-by-sample matrix in the first place, and if R can be of any help in that process; also, in the other end, what models one would use after that first pass of visualisation. But this isn’t a criticism of the presenters — time and complexity constraints apply. (If I was asked to present how I use R any demos would be toy analyses of clean datasets. That is the way these things go.)

We also heard repeated praise for and recommendations of the hadleyverse and data.table. I’m not a data.tabler myself, but I probably should be. And I completely agree about the value of dplyr — there’s this one analysis where a couple of lines with dplyr changed it from ”argh, do I have to rewrite this in C?” to being workable. I think we also saw all the three plotting systems: base graphics, ggplot2 and lattice in action.

# Finding the distance from ChIP signals to genes

I’ve had a couple of months off from blogging. Time for some computer-assisted biology! Robert Griffin asks on Stack Exchange about finding the distance between HP1 binding sites and genes in Drosophila melanogaster.  We can get a rough idea with some public chromatin immunoprecipitation data, R and the wonderful BEDTools.

### Finding some binding sites

There are indeed some ChIP-seq datasets on HP1 available. I looked up these ones from modENCODE: modENCODE_3391 and modENCODE_3392, using two different antibodies for Hp1b in 16-24 h old embroys. I’m not sure since the modENCODE site doesn’t seem to link datasets to publications, but I think this is the paper where the results are reported: A cis-regulatory map of the Drosophila genome (Nègre & al 2011).

What they’ve done, in short, is cross-linking with formaldehyde, sonicate DNA into fragments, capture fragments with either of the two antibodies and sequence those fragments. They aligned reads with Eland (Illumina’s old proprietary aligner) and called peaks (i.e. regions where there is a lot of reads, which should reflect regions bound by Hp1b) with MACS. We can download their peaks in general feature format.

I don’t know whether there is any way to make completely computation predictions of Hp1 binding sites but I doubt it.

### Some data cleaning

The files are available from ftp, and for the below analysis I’ve unzipped them and called them modENCODE_3391.gff3 and modENCODE_3392.gff3. GFF is one of all those tab separated text files that people use for genomic coordinates. If you do any bioinformatics type work you will have to convert back and forth between them and I suggest bookmarking the UCSC Genome Browser Format FAQ.

Even when we trust in their analysis, some processing of files is always required. In this case, MACS sometimes outputs peaks with negative start coordinates in the beginning of a chromosome. BEDTools will have none of that, because ”malformed GFF entry at line … Start was greater than end”. In this case, it happens only at a few lines, and I decided to set those start coordinates to 1 instead.

We need a small script to solve that. As I’ve written before, any language will do, but I like R and tend to do my utility scripting in R (and bash). If the files were incredibly huge and didn’t fit in memory, we’d have to work through the files line by line or chunk by chunk. But in this case we can just read everything at once and operate on it with vectorised R commands, and then write the table again.

modENCODE_3391 <- read.table("modENCODE_3391.gff3", stringsAsFactors=F, sep="\t")

fix.coord <- function(gff) {
gff$V4[which(gff$V4 < 1)] <- 1
gff
}

write.gff <- function(gff, file) {
write.table(gff, file=file, row.names=F, col.names=F,
quote=F, sep="\t")
}

write.gff(fix.coord(modENCODE_3391), file="cleaned_3391.gff3")
write.gff(fix.coord(modENCODE_3392), file="cleaned_3392.gff3")


### Flybase transcripts

To find the distance to genes, we need to know where the genes are. The best source is probably the annotation made by Flybase, which I downloaded from the Ensembl ftp in General transfer format (GTF, which is close enough to GFF that we don’t have to care about the differences right now).

This file contains a lot of different features. We extract the transcripts and find where the transcript model starts, taking into account whether the transcript is in the forward or reverse direction (this information is stored in columns 4, 5 and 7 of the GTF file). We store this in a new GTF file of transcript start positions, which is the one we will feed to BEDTools:

ensembl <- read.table("Drosophila_melanogaster.BDGP5.75.gtf",
stringsAsFactors=F, sep="\t")

transcript <- subset(ensembl, V3=="transcript")
transcript.start <- transcript
transcript.start$V3 <- "transcript_start" transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+",
transcript$V4, transcript$V5)

write.gff(transcript.start, file="ensembl_transcript_start.gtf")


### Finding distance with BEDTools

Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It’s a command line tool, and if you haven’t already you will need to download and compile it, which I recall being painless.

bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ”>” to capture it in a text file.

Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file.

#!/bin/bash
Rscript clean_files.R

bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \
> closest_element_3391.txt
bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \
> closest_element_3392.txt


### Some results

With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site.

closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t")

library(ggplot2)
qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) +
facet_wrap(~V2, scale="free_y")


Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks.

closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t")

combined <- merge(closest_3391, closest_3392,
by.x=c("V1", "V2","V4", "V5", "V9"),
by.y=c("V1", "V2","V4", "V5", "V9"))

qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)


(If you’re wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.)

The question mentioned the DGRP. I don’t know that anyone has looked at ChIP in the DGRP lines, but wouldn’t that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL … What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don’t know if that would tell you much — does anyone know what kind of variant would affect Hp1 binding?

Happy hacking!

# More fun with %.% and %>%

The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:

library(dplyr)
data %.%
melt(id.vars=c("treatment", "sex")) %.%
group_by(sex, treatment, variable) %.%
summarise(mean(value))


I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.

data %.% lm(formula=response1 ~ factor(sex)) %.% summary()


As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:

library(magrittr)
data %>% lm(response1 ~ factor(sex), .) %>% summary()


Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development.

# Using R: quickly calculating summary statistics (with dplyr)

I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.

So, here comes the code to do the thing we did yesterday but with dplyr:

## The code for the toy data is exactly the same
data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
treatment = rep(c(1, 2), 1000),
response1 = rnorm(2000, 0, 1),
response2 = rnorm(2000, 0, 1))

## reshape2 still does its thing:
library(reshape2)
melted <- melt(data, id.vars=c("sex", "treatment"))

## This part is new:
library(dplyr)
grouped <- group_by(melted, sex, treatment)
summarise(grouped, mean=mean(value), sd=sd(value))


When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:

Source: local data frame [4,000 x 4]
Groups: sex, treatment, variable

sex treatment  variable       value
1    1         1 response1 -0.15668214
2    1         2 response1 -0.40934759
3    1         1 response1  0.07103731
4    1         2 response1  0.15113270
5    1         1 response1  0.30836910
6    1         2 response1 -1.41891407
7    1         1 response1 -0.07390246
8    1         2 response1 -1.34509686
9    1         1 response1  1.97215697
10   1         2 response1 -0.08145883


The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.

The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.

Source: local data frame [8 x 5]
Groups: sex, treatment

sex treatment  variable         mean        sd
1   1         1 response1  0.021856280 1.0124371
2   1         1 response2  0.045928150 1.0151670
3   1         2 response1 -0.065017971 0.9825428
4   1         2 response2  0.011512867 0.9463053
5   2         1 response1 -0.005374208 1.0095468
6   2         1 response2 -0.051699624 1.0154782
7   2         2 response1  0.046622111 0.9848043
8   2         2 response2 -0.055257295 1.0134786


Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:

summarise(group_by(melted, sex, treatment, variable),
mean=mean(value), sd=sd(value))

melted %.% group_by(sex, treatment, variable) %.%
summarise(mean=mean(value), sd=sd(value))


The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R.

# Using R: quickly calculating summary statistics from a data frame

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
treatment = rep(c(1, 2), 1000),
response1 = rnorm(2000, 0, 1),
response2 = rnorm(2000, 0, 1))

  sex treatment   response1   response2
1   1         1 -0.15668214 -0.13663012
2   1         2 -0.40934759 -0.07220426
3   1         1  0.07103731 -2.60549018
4   1         2  0.15113270  1.81803178
5   1         1  0.30836910  0.32596016
6   1         2 -1.41891407  1.12561812


Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)\$response1)/
sqrt(nrow(subset(data, sex == 1 & treatment == 1)))


Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr)
library(reshape2)


First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))


The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

  sex treatment  variable       value
1   1         1 response1 -0.15668214
2   1         2 response1 -0.40934759
3   1         1 response1  0.07103731
4   1         2 response1  0.15113270
5   1         1 response1  0.30836910
6   1         2 response1 -1.41891407


Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise,
mean = mean(value), sd = sd(value),
sem = sd(value)/sqrt(length(value)))

  sex treatment  variable         mean        sd        sem
1   1         1 response1  0.021856280 1.0124371 0.04527757
2   1         1 response2  0.045928150 1.0151670 0.04539965
3   1         2 response1 -0.065017971 0.9825428 0.04394065
4   1         2 response2  0.011512867 0.9463053 0.04232006
5   2         1 response1 -0.005374208 1.0095468 0.04514830
6   2         1 response2 -0.051699624 1.0154782 0.04541357
7   2         2 response1  0.046622111 0.9848043 0.04404179
8   2         2 response2 -0.055257295 1.0134786 0.04532414