Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr)
results <- ddply(some_data, "key", function(x) {
  ## do something; return data.frame()
})

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) {
  ## do something; return data.frame()
}
result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr)
grouped <- group_by(some_data, key)
result <- do(grouped, calculate(.))

Or once again with magrittr:

library(magrittr)
some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key)
## will not work: Error: Results are not data frames at positions ...
try(result <- do(grouped, calculate))

Don’t forget the dot!

Using R: a function that adds multiple ggplot2 layers

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:

library(ggplot2)

data(ChickWeight)
diet1 <- subset(ChickWeight, Diet == 1)
diet2 <- subset(ChickWeight, Diet == 2)

This is just an example that shows the phenomenon. The first two functions will work, but combining them won’t.

add_line <- function(df) {
  geom_line(aes(x = Time, y = weight, group = Chick), data = df)
}

add_points <- function(df) {
  geom_point(aes(x = Time, y = weight), data = df)
}

add_line_points <- function(df) {
  add_line(df) + add_points(df)
}

## works
(plot1 <- ggplot() + add_line(diet1) + add_points(diet1))

## won't work: non-numeric argument to binary operator
try((plot2 <- ggplot() + add_line_points(diet1)))

Update: In the comments, Eric Pedersen gave a neat solution: stick the layers in a list and add the list. Like so:

(plot2.5 <- ggplot() + list(add_line(diet1), add_points(diet1)))

Nice! I did not know that one.

Also, you can get the same result by putting mappings and data in the ggplot function. This will work if all layers are going to plot the same data, but that does it for some cases:

## bypasses the issue by putting mappings in ggplot()
(plot3 <- ggplot(aes(x = Time, y = weight, group = Chick), data = diet1) +
    geom_line() + geom_point())

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

## bypasses the issue and gives a similar feel with pipes

library(magrittr)

add_line_points2 <- function(plot, df, ...) {
  plot +
    geom_line(aes(x = Time, y = weight, group = Chick), ..., data = df) +
    geom_point(aes(x = Time, y = weight), ..., data = df)
}

(plot4 <- ggplot() %>% add_line_points2(diet1) %>%
   add_line_points2(diet2, colour = "red"))

Finally, in many cases, one can stick all the data in a combined data frame, and avoid building up the plot from different data frames altogether.

## plot the whole dataset at once
(plot5 <- ggplot(aes(x = Time, y = weight, group = Chick, colour = Diet),
                 data = ChickWeight) +
   geom_line() + geom_point())

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time is just one more mapping in aes(). No looping or custom functions required.

I hope that was of some use.

Using R: Don’t save your workspace

To everyone learning R: Don’t save your workspace.

When you exit an R session, you’re faced with the question of whether or not to save your workspace. You should almost never answer yes. Saving your workspace creates an image of your current variables and functions, and saves them to a file called ”.RData”. When you re-open R from that working directory, the workspace will be loaded, and all these things will be available to you again. But you don’t want that, so don’t save your workspace.

Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. You end up having to put an inordinate trust in your old self. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace.

What should one do instead? One should source the script often, ideally from freshly minted R sessions, to make sure to always be working with a script that runs and does what it’s supposed to. Storing a data frame in the workspace can seem comforting, but what happens the day I overwrite it by mistake? Don’t save your workspace.

Yes, I’m exaggerating. When using any modern computer system, we rely on saved information and saved state all the time. And yes, every time a computation takes too much time to reproduce, one should write it to a file to load every time. But I that should be a deliberate choice, worthy of its own save() and load() calls, and certainly not something one does with simple stuff that can be reproduced a the blink of an eye. Put more trust in your script than in your memory, and don’t save your workspace.

It seems dplyr is overtaking correlation heatmaps

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

library(tidyr)
library(dplyr)
library(magrittr)

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
  group_by(sex, treatment, variable) %>%
  summarise(mean = mean(value), sd = sd(value))

Row by row we:

1-3: Load the packages.

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

    sex treatment  variable        mean        sd
  (dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059

Using R: tibbles and the t.test function

A participant in the R course I’m teaching showed me a case where a tbl_df (the new flavour of data frame provided by the tibble package; standard in new RStudio versions) interacts badly with the t.test function. I had not seen this happen before. The reason is this:

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame (tibble announcement on RStudio blog)

Here is code that reproduces the situation (tibble version 1.2):

data(chickwts)
chick_tibble <- as_tibble(chickwts)
casein <- subset(chickwts, feed == "casein")
sunflower <- subset(chick_tibble, feed == "sunflower")
t.test(sunflower$weight, casein$weight) ## this works
t.test(as.data.frame(sunflower[, 1]), as.data.frame(casein[, 1])) ## this works too
t.test(sunflower[, 1], casein[, 1]) ## this doesn't

Error: Unsupported use of matrix or array for column indexing

I did not know that. The solution, which they found themselves, is to use as.data.frame.

I can see why not dropping to a vector makes sense. I’m sure you’ve at some point expected a data frame and got an ”$ operator is invalid for atomic vectors”. But it’s an unfortunate fact that number of weird little thingamajigs to remember is always strictly increasing as the language evolves. And it’s a bit annoying that the standard RStudio setup breaks code that uses an old stats function, even if it’s in somewhat non-obvious way.

Balancing a centrifuge

I saw this cute little paper on arxiv about balancing a centrifuge: Peil & Hauryliuk (2010) A new spin on spinning your samples: balancing rotors in a non-trivial manner. Let us have a look at the maths of balancing a centrifuge.

The way I think most people (including myself) balance their samples is to put them opposite of each other, just like Peil & Hauryliuk write. However, there are many more balanced configurations, some of which look really weird. The authors generate three balanced configurations with increasing oddity, show them to researchers and ask them whether they are balanced. About half, 30% and 15% of them identified each configuration as balanced. Here are the configurations:

configuration_plot
(Drawn after their paper.)

Take a rotor in a usual bench top centrifuge. It’s a large, in itself balanced, piece of metal with holes to put microcentrifuge tubes in. We assume that all tubes have the same mass m and that the holes are equally spaced. The rotor will spin around its own axis, helping us separate samples and pellet precipitates etc. When the centrifuge is balanced, the centre of mass of the samples will be aligned with the axis of rotation. So, if we place a two-dimensional coordinate system on the axis of rotation like so,

koordinater

the tubes are positioned on a circle around it:

x_i = r \cos {\theta_i}
y_i = r \sin {\theta_i}

The angle to each position in the rotor will be

\theta(i) = \dfrac{2\pi(i - 1)}{N}

where i is the position in question, starting at 1, and N the number of positions in the rotor. Let’s label each configuration by the numbers of the positions that are occupied. So we could talk about (1, 16)30 as the common balanced pair of tubes in a 30-position rotor. (Yeah, I know, counting from 1 is a lot more confusing than counting from zero. Let’s view it as a kind of practice for dealing with genomic coordinates.)

We express the position of each tube (treated as a point mass) as a vector. Since we put the origin on the axis of rotation, these vectors have to sum to zero for the centrifuge to be balanced.

\sum \limits_{i} {m\mathbf{r_i}} = \mathbf{0}

Since the masses are equal, they can be removed, as can the radius, which is constant, and we can consider the x and y coordinates separately.

\left(\begin{array}{c} \sum \limits_{i} {\cos {\theta(i)}} \\ \sum \limits_{i} {\sin {\theta(i)}} \end{array}\right) = \left(\begin{array}{c} 0 \\ 0 \end{array}\right)

For the (1, 16)30 configuration, the vectors are

\left(\begin{array}{c} \cos {\theta(1)} \\ \sin {\theta(1)} \end{array}\right) + \left(\begin{array}{c} \cos {\theta(16)} \\ \sin {\theta(16)} \end{array}\right) = \left(\begin{array}{c} \cos {0} \\ \sin {0} \end{array}\right) + \left(\begin{array}{c} \cos {\pi} \\ \sin {\pi} \end{array}\right) = \left(\begin{array}{c} 1 \\ 0 \end{array}\right) + \left(\begin{array}{c} -1 \\ 0 \end{array}\right)

So we haven’t been deluding ourselves. This configuration is balanced. That is about as much maths as I’m prepared to do in LaTex in a WordPress blog editor. So let’s implement this in R code:

library(magrittr)
theta <- function(n, N) (n - 1) * 2 * pi / N
tube <- function(theta) c(cos(theta), sin(theta))

Now, we can look at Peil & Hauryliuk’s configurations, for instance the first (1, 11, 14, 15, 21, 29, 30)30

positions <- c(1, 11, 14, 15, 21, 29, 30)
tubes <- positions %>% lapply(theta, N = 30) %>% lapply(tube)
c(sum(unlist(lapply(tubes, function(x) x[1]))),
  sum(unlist(lapply(tubes, function(x) x[2]))))

The above code 1) defines the configuration; 2) turns positions into angles and then tube coordinates; and 3) sums the x and y coordinates separately. The result isn’t exactly zero (for computational reasons), but close enough. Putting in their third configuration, (4, 8, 14, 15, 21, 27, 28)30, we again get almost zero. Even this strange-looking configuration seems to be balanced.

I’m biased because I read the text first, but if someone asked me, I would have to think about the first two configurations, and there is no way I would allow a student to run with the third if I saw it in the lab. That conservative attitude, though not completely scientific, might not be the worst thing. Centrifuge accidents are serious business, and as the authors note:

Finally, non-symmetric arrangement (Fig. 1C) was recognized as balanced by 17% of researchers. Some of these were actually calculating moment of inertia, i.e. were coming to solution knowingly, the rest where basically guessing. The latter should be banished from laboratory practice, since these people are ready to make dangerous decisions without actual understanding of the case, which renders them extremely dangerous in the laboratory settings.

(Plotting code for the first figure is on Github.)

Toying with models: The Game of Life with selection

Conway’s Game of life is probably the most famous cellular automaton, consisting of a grid of cells developing according simple rules. Today, we’re going to add mutation and selection to the game, and let patterns evolve.

The fate of a cell depends on the number cells that live in the of neighbouring positions. A cell with fewer than two neighbours die from starvation. A cell with more than three neighbours die from overpopulation. If a position is empty and has three neighbours, it will be filled by a cell. These rules lead to some interesting patterns, such as still lives that never change, oscillators that alternate between states, patterns that eventually die out but take long time to do so, patterns that keep generating new cells, and so forth.

oscillators still_life

When I played with the Game of life when I was a child, I liked one pattern called ”virus”, that looked a bit like this. On its own, a grid of four-by-four blocks is a still life, but add one cell (the virus), and the whole pattern breaks. This is a version on a 30 x 30 cell board. It unfolds rather slowly, but in the end, a glider collides with a block, and you are left with some oscillators.

blocks virus

There are probably other interesting ways that evolution could be added to the game of life. We will take a hierarchical approach where the game is taken to describe development, and the unit of selection is the pattern. Each generation, we will create a variable population of patterns, allow them to develop and pick the fittest. So, here the term ”development” refers to what happens to a pattern when applying the rules of life, and the term ”evolution” refers to how the population of patterns change over the generations. This differ slightly from Game of life terminology, where ”evolution” and ”generation” usually refer to the development of a pattern, but it is consistent with how biologists use the words: development takes place during the life of an organism, and evolution happens over the generations as organisms reproduce and pass on their genes to offspring. I don’t think there’s any deep analogy here, but we can think of the initial state of the board as the heritable material that is being passed on and occasionally mutated. We let the pattern develop, and at some point, we apply selection.

First, we need an implementation of the game of life in R. We will represent the board as a matrix of ones (live cells) and zeroes (empty positions). Here is function develops the board one tick in time. After dealing with the corners and edges, it’s very short, but also slow as molasses. The next function does this for a given number of ticks.

## Develop one tick. Return new board matrix.
develop <- function(board_matrix) {
  padded <- rbind(matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2),
                  cbind(matrix(0, ncol = 1, nrow = nrow(board_matrix)), 
                        board_matrix,
                        matrix(0, ncol = 1, nrow = nrow(board_matrix))),
                  matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2))
  new_board <- padded
  for (i in 2:(nrow(padded) - 1)) {
    for (j in 2:(ncol(padded) - 1)) {
      neighbours <- sum(padded[(i-1):(i+1), (j-1):(j+1)]) - padded[i, j]
      if (neighbours < 2 | neighbours > 3) {
        new_board[i, j] <- 0
      }
      if (neighbours == 3) {
        new_board[i, j] <- 1
      }
    }
  }
  new_board[2:(nrow(padded) - 1), 2:(ncol(padded) - 1)]
}

## Develop a board a given number of ticks.
tick <- function(board_matrix, ticks) {
  if (ticks > 0) {
    for (i in 1:ticks) {
      board_matrix <- develop(board_matrix) 
    }
  }
  board_matrix
}

We introduce random mutations to the board. We will use a mutation rate of 0.0011 per cell, which gives us a mean of a bout one mutation for a 30 x 30 board.

## Mutate a board
mutate <- function(board_matrix, mutation_rate) {
  mutated <- as.vector(board_matrix)
  outcomes <- rbinom(n = length(mutated), size = 1, prob = mutation_rate)
  for (i in 1:length(outcomes)) {
    if (outcomes[i] == 1)
      mutated[i] <- ifelse(mutated[i] == 0, 1, 0)
  }
  matrix(mutated, ncol = ncol(board_matrix), nrow = nrow(board_matrix))
}

I was interested in the virus pattern, so I decided to apply a simple directional selection scheme for number of cells at tick 80, which is a while after the virus pattern has stabilized itself into oscillators. We will count the number of cells at tick 80 and call that ”fitness”, even if it actually isn’t (it is a trait that affects fitness by virtue of the fact that we select on it). We will allow the top half of the population to produce two offspring each, thus keeping the population size constant at 100 individuals.

## Calculates the fitness of an individual at a given time
get_fitness <- function(board_matrix, time) {
  board_matrix %>% tick(time) %>% sum
}

## Develop a generation and calculate fitness
grow <- function(generation) {
  generation$fitness <- sapply(generation$board, get_fitness, time = 80)
  generation
}

## Select a generation based on fitness, and create the next generation,
## adding mutation.
next_generation <- function(generation) {
  keep <- order(generation$fitness, decreasing = TRUE)[1:50]
  new_generation <- list(board = vector(mode = "list", length = 100),
                         fitness = numeric(100))
  ix <- rep(keep, each = 2)
  for (i in 1:100) new_generation$board[[i]] <- generation$board[[ix[i]]]
  new_generation$board <- lapply(new_generation$board, mutate, mutation_rate = mu)
  new_generation
}

## Evolve a board, with mutation and selection for a number of generation.
evolve <- function(board, n_gen = 10) { 
  generations <- vector(mode = "list", length = n_gen)

  generations[[1]] <- list(board = vector(mode = "list", length = 100),
                           fitness = numeric(100))
  for (i in 1:100) generations[[1]]$board[[i]] <- board
  generations[[1]]$board <- lapply(generations[[1]]$board, mutate, mutation_rate = mu)

  for (i in 1:(n_gen - 1)) {
    generations[[i]] <- grow(generations[[i]])
    generations[[i + 1]] <- next_generation(generations[[i]])
  }
  generations[[n_gen]] <- grow(generations[[n_gen]])
  generations
}

Let me now tell you that I was almost completely wrong about what happens with this pattern once you apply selection. I thought that the initial pattern of nine stable blocks (36 cells) was pretty good, and that it would be preserved for long, and that virus-like patterns (like the first animation above) would mostly have degenerated around 80. As this plot of the evolution of the number of cells in one replicate shows, I grossly underestimated this pattern. The y-axis is number of cells at time 80, and the x-axis individuals, the vertical lines separating generations. Already by generation five, most individuals do better than 36 cells in this case:

blocks_trajectory_plot

As one example, here is the starting position and the state at time 80 for a couple of individuals from generation 10 of one of my replicates:

blocks_g10_1 blocks_g10_80

blocks_g10_1b blocks_g10_80b

Here is how the average cell number at time 80 evolves in five replicates. Clearly, things are still going on at generation 10, not only in the replicate shown above.

mean_fitness_blocks

Here is the same plot for the virus pattern I showed above, i.e. the blocks but with one single added cell, fixed in the starting population. Prior genetic architecture matters. Even if the virus pattern has fewer cells than the blocks pattern at time 80, it is apparently a better starting point to quickly evolve more cells:

mean_fitness_virus

And finally, out of curiosity, what happens if we start with an empty 30 x 30 board?

mean_fitness_blank

Not much. The simple still life block evolves a lot. But in my replicate three, this creature emerged. ”Life, uh, finds a way.”

blank_denovo

Unfortunately, many of the selected patterns extended to the edges of the board, making them play not precisely the game of life, but the game of life with edge effects. I’d like to use a much bigger board and see how far patterns extend. It would also be fun to follow them longer. To do that, I would need to implement a more efficient way to update the board (this is very possible, but I was lazy). It would also be fun to select for something more complex, with multiple fitness components, potentially in conflict, e.g. favouring patterns that grow large at a later time while being as small as possible at an earlier time.

Code is on github, including functions to display and animate boards with the animation package and ImageMagick, and code for the plots. Again, the blocks_selection.R script is slow, so leave it running and go do something else.