# It seems dplyr is overtaking correlation heatmaps

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

library(tidyr)
library(dplyr)
library(magrittr)

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
treatment = rep(c(1, 2), 1000),
response1 = rnorm(2000, 0, 1),
response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
group_by(sex, treatment, variable) %>%
summarise(mean = mean(value), sd = sd(value))


Row by row we:

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

sex treatment  variable        mean        sd
(dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059


# Using R: tibbles and the t.test function

A participant in the R course I’m teaching showed me a case where a tbl_df (the new flavour of data frame provided by the tibble package; standard in new RStudio versions) interacts badly with the t.test function. I had not seen this happen before. The reason is this:

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame (tibble announcement on RStudio blog)

Here is code that reproduces the situation (tibble version 1.2):

data(chickwts)
chick_tibble <- as_tibble(chickwts)
casein <- subset(chickwts, feed == "casein")
sunflower <- subset(chick_tibble, feed == "sunflower")
t.test(sunflower$weight, casein$weight) ## this works
t.test(as.data.frame(sunflower[, 1]), as.data.frame(casein[, 1])) ## this works too
t.test(sunflower[, 1], casein[, 1]) ## this doesn't


Error: Unsupported use of matrix or array for column indexing

I did not know that. The solution, which they found themselves, is to use as.data.frame.

I can see why not dropping to a vector makes sense. I’m sure you’ve at some point expected a data frame and got an ”operator is invalid for atomic vectors”. But it’s an unfortunate fact that number of weird little thingamajigs to remember is always strictly increasing as the language evolves. And it’s a bit annoying that the standard RStudio setup breaks code that uses an old stats function, even if it’s in somewhat non-obvious way. # Balancing a centrifuge I saw this cute little paper on arxiv about balancing a centrifuge: Peil & Hauryliuk (2010) A new spin on spinning your samples: balancing rotors in a non-trivial manner. Let us have a look at the maths of balancing a centrifuge. The way I think most people (including myself) balance their samples is to put them opposite of each other, just like Peil & Hauryliuk write. However, there are many more balanced configurations, some of which look really weird. The authors generate three balanced configurations with increasing oddity, show them to researchers and ask them whether they are balanced. About half, 30% and 15% of them identified each configuration as balanced. Here are the configurations: (Drawn after their paper.) Take a rotor in a usual bench top centrifuge. It’s a large, in itself balanced, piece of metal with holes to put microcentrifuge tubes in. We assume that all tubes have the same mass m and that the holes are equally spaced. The rotor will spin around its own axis, helping us separate samples and pellet precipitates etc. When the centrifuge is balanced, the centre of mass of the samples will be aligned with the axis of rotation. So, if we place a two-dimensional coordinate system on the axis of rotation like so, the tubes are positioned on a circle around it: $x_i = r \cos {\theta_i}$ $y_i = r \sin {\theta_i}$ The angle to each position in the rotor will be $\theta(i) = \dfrac{2\pi(i - 1)}{N}$ where i is the position in question, starting at 1, and N the number of positions in the rotor. Let’s label each configuration by the numbers of the positions that are occupied. So we could talk about (1, 16)30 as the common balanced pair of tubes in a 30-position rotor. (Yeah, I know, counting from 1 is a lot more confusing than counting from zero. Let’s view it as a kind of practice for dealing with genomic coordinates.) We express the position of each tube (treated as a point mass) as a vector. Since we put the origin on the axis of rotation, these vectors have to sum to zero for the centrifuge to be balanced. $\sum \limits_{i} {m\mathbf{r_i}} = \mathbf{0}$ Since the masses are equal, they can be removed, as can the radius, which is constant, and we can consider the x and y coordinates separately. $\left(\begin{array}{c} \sum \limits_{i} {\cos {\theta(i)}} \\ \sum \limits_{i} {\sin {\theta(i)}} \end{array}\right) = \left(\begin{array}{c} 0 \\ 0 \end{array}\right)$ For the (1, 16)30 configuration, the vectors are $\left(\begin{array}{c} \cos {\theta(1)} \\ \sin {\theta(1)} \end{array}\right) + \left(\begin{array}{c} \cos {\theta(16)} \\ \sin {\theta(16)} \end{array}\right) = \left(\begin{array}{c} \cos {0} \\ \sin {0} \end{array}\right) + \left(\begin{array}{c} \cos {\pi} \\ \sin {\pi} \end{array}\right) = \left(\begin{array}{c} 1 \\ 0 \end{array}\right) + \left(\begin{array}{c} -1 \\ 0 \end{array}\right)$ So we haven’t been deluding ourselves. This configuration is balanced. That is about as much maths as I’m prepared to do in LaTex in a WordPress blog editor. So let’s implement this in R code: library(magrittr) theta <- function(n, N) (n - 1) * 2 * pi / N tube <- function(theta) c(cos(theta), sin(theta))  Now, we can look at Peil & Hauryliuk’s configurations, for instance the first (1, 11, 14, 15, 21, 29, 30)30 positions <- c(1, 11, 14, 15, 21, 29, 30) tubes <- positions %>% lapply(theta, N = 30) %>% lapply(tube) c(sum(unlist(lapply(tubes, function(x) x[1]))), sum(unlist(lapply(tubes, function(x) x[2]))))  The above code 1) defines the configuration; 2) turns positions into angles and then tube coordinates; and 3) sums the x and y coordinates separately. The result isn’t exactly zero (for computational reasons), but close enough. Putting in their third configuration, (4, 8, 14, 15, 21, 27, 28)30, we again get almost zero. Even this strange-looking configuration seems to be balanced. I’m biased because I read the text first, but if someone asked me, I would have to think about the first two configurations, and there is no way I would allow a student to run with the third if I saw it in the lab. That conservative attitude, though not completely scientific, might not be the worst thing. Centrifuge accidents are serious business, and as the authors note: Finally, non-symmetric arrangement (Fig. 1C) was recognized as balanced by 17% of researchers. Some of these were actually calculating moment of inertia, i.e. were coming to solution knowingly, the rest where basically guessing. The latter should be banished from laboratory practice, since these people are ready to make dangerous decisions without actual understanding of the case, which renders them extremely dangerous in the laboratory settings. (Plotting code for the first figure is on Github.) # R in genomics @ SciLifeLab, Solna Dear diary, I went to the Stockholm R useR group meetup on R in genomics at the Stockholm node of SciLifeLab. It was nice. If I had worked a bit closer I would attend meetups all the time. I even got to be pretentious with my notebook while waiting for the train. The speakers were: Jakub Orzechowski Westholm on R and genomics in general. He demonstrated genome browser-style tracks with Gviz, some GenomicRanges, and a couple of common plots of gene expression data. I have been on the fence about what package I should use for drawing genes and variants along the genome. I should play with Gviz. Daniel Klevebring on clinical sequencing and how he uses R (not that much) in sequencing pipelines aimed at targeting the right therapy to patients based on the mutations in their cancer cells. He mentioned some getopt snippets for getting R to play nicely on the command line, which is something I should definitely try more! Finally, Arvind Singh Mer on predictive modelling for clinical genomics (like the abovementioned ClinSeq data). He showed the caret package for machine learning, with an elastic net regression. I don’t know the rest of the audience, so maybe the choice to gear talks towards the non-bio* person was spot on, but that made things a bit less interesting for me. For instance, in Jakub’s talk about gene expression, I would’ve preferred more about the messy stuff: how to make that nice gene-by-sample matrix in the first place, and if R can be of any help in that process; also, in the other end, what models one would use after that first pass of visualisation. But this isn’t a criticism of the presenters — time and complexity constraints apply. (If I was asked to present how I use R any demos would be toy analyses of clean datasets. That is the way these things go.) We also heard repeated praise for and recommendations of the hadleyverse and data.table. I’m not a data.tabler myself, but I probably should be. And I completely agree about the value of dplyr — there’s this one analysis where a couple of lines with dplyr changed it from ”argh, do I have to rewrite this in C?” to being workable. I think we also saw all the three plotting systems: base graphics, ggplot2 and lattice in action. # Finding the distance from ChIP signals to genes I’ve had a couple of months off from blogging. Time for some computer-assisted biology! Robert Griffin asks on Stack Exchange about finding the distance between HP1 binding sites and genes in Drosophila melanogaster. We can get a rough idea with some public chromatin immunoprecipitation data, R and the wonderful BEDTools. ### Finding some binding sites There are indeed some ChIP-seq datasets on HP1 available. I looked up these ones from modENCODE: modENCODE_3391 and modENCODE_3392, using two different antibodies for Hp1b in 16-24 h old embroys. I’m not sure since the modENCODE site doesn’t seem to link datasets to publications, but I think this is the paper where the results are reported: A cis-regulatory map of the Drosophila genome (Nègre & al 2011). What they’ve done, in short, is cross-linking with formaldehyde, sonicate DNA into fragments, capture fragments with either of the two antibodies and sequence those fragments. They aligned reads with Eland (Illumina’s old proprietary aligner) and called peaks (i.e. regions where there is a lot of reads, which should reflect regions bound by Hp1b) with MACS. We can download their peaks in general feature format. I don’t know whether there is any way to make completely computation predictions of Hp1 binding sites but I doubt it. ### Some data cleaning The files are available from ftp, and for the below analysis I’ve unzipped them and called them modENCODE_3391.gff3 and modENCODE_3392.gff3. GFF is one of all those tab separated text files that people use for genomic coordinates. If you do any bioinformatics type work you will have to convert back and forth between them and I suggest bookmarking the UCSC Genome Browser Format FAQ. Even when we trust in their analysis, some processing of files is always required. In this case, MACS sometimes outputs peaks with negative start coordinates in the beginning of a chromosome. BEDTools will have none of that, because ”malformed GFF entry at line … Start was greater than end”. In this case, it happens only at a few lines, and I decided to set those start coordinates to 1 instead. We need a small script to solve that. As I’ve written before, any language will do, but I like R and tend to do my utility scripting in R (and bash). If the files were incredibly huge and didn’t fit in memory, we’d have to work through the files line by line or chunk by chunk. But in this case we can just read everything at once and operate on it with vectorised R commands, and then write the table again. modENCODE_3391 <- read.table("modENCODE_3391.gff3", stringsAsFactors=F, sep="\t") modENCODE_3392 <- read.table("modENCODE_3392.gff3", stringsAsFactors=F, sep="\t") fix.coord <- function(gff) { gffV4[which(gff$V4 < 1)] <- 1 gff } write.gff <- function(gff, file) { write.table(gff, file=file, row.names=F, col.names=F, quote=F, sep="\t") } write.gff(fix.coord(modENCODE_3391), file="cleaned_3391.gff3") write.gff(fix.coord(modENCODE_3392), file="cleaned_3392.gff3")  ### Flybase transcripts To find the distance to genes, we need to know where the genes are. The best source is probably the annotation made by Flybase, which I downloaded from the Ensembl ftp in General transfer format (GTF, which is close enough to GFF that we don’t have to care about the differences right now). This file contains a lot of different features. We extract the transcripts and find where the transcript model starts, taking into account whether the transcript is in the forward or reverse direction (this information is stored in columns 4, 5 and 7 of the GTF file). We store this in a new GTF file of transcript start positions, which is the one we will feed to BEDTools: ensembl <- read.table("Drosophila_melanogaster.BDGP5.75.gtf", stringsAsFactors=F, sep="\t") transcript <- subset(ensembl, V3=="transcript") transcript.start <- transcript transcript.start$V3 <- "transcript_start"
transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+", transcript$V4, transcript\$V5)

write.gff(transcript.start, file="ensembl_transcript_start.gtf")


### Finding distance with BEDTools

Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It’s a command line tool, and if you haven’t already you will need to download and compile it, which I recall being painless.

bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ”>” to capture it in a text file.

Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file.

#!/bin/bash
Rscript clean_files.R

bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \
> closest_element_3391.txt
bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \
> closest_element_3392.txt


### Some results

With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site.

closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t")

library(ggplot2)
qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) +
facet_wrap(~V2, scale="free_y")


Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks.

closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t")

combined <- merge(closest_3391, closest_3392,
by.x=c("V1", "V2","V4", "V5", "V9"),
by.y=c("V1", "V2","V4", "V5", "V9"))

qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)


(If you’re wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.)

The question mentioned the DGRP. I don’t know that anyone has looked at ChIP in the DGRP lines, but wouldn’t that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL … What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don’t know if that would tell you much — does anyone know what kind of variant would affect Hp1 binding?

Happy hacking!

# More fun with %.% and %>%

The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:

library(dplyr)
data %.%
melt(id.vars=c("treatment", "sex")) %.%
group_by(sex, treatment, variable) %.%
summarise(mean(value))


I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.

data %.% lm(formula=response1 ~ factor(sex)) %.% summary()


As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:

library(magrittr)
data %>% lm(response1 ~ factor(sex), .) %>% summary()


Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development.

# Using R: quickly calculating summary statistics (with dplyr)

I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.

So, here comes the code to do the thing we did yesterday but with dplyr:

## The code for the toy data is exactly the same
data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
treatment = rep(c(1, 2), 1000),
response1 = rnorm(2000, 0, 1),
response2 = rnorm(2000, 0, 1))

## reshape2 still does its thing:
library(reshape2)
melted <- melt(data, id.vars=c("sex", "treatment"))

## This part is new:
library(dplyr)
grouped <- group_by(melted, sex, treatment)
summarise(grouped, mean=mean(value), sd=sd(value))


When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:

Source: local data frame [4,000 x 4]
Groups: sex, treatment, variable

sex treatment  variable       value
1    1         1 response1 -0.15668214
2    1         2 response1 -0.40934759
3    1         1 response1  0.07103731
4    1         2 response1  0.15113270
5    1         1 response1  0.30836910
6    1         2 response1 -1.41891407
7    1         1 response1 -0.07390246
8    1         2 response1 -1.34509686
9    1         1 response1  1.97215697
10   1         2 response1 -0.08145883


The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.

The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.

Source: local data frame [8 x 5]
Groups: sex, treatment

sex treatment  variable         mean        sd
1   1         1 response1  0.021856280 1.0124371
2   1         1 response2  0.045928150 1.0151670
3   1         2 response1 -0.065017971 0.9825428
4   1         2 response2  0.011512867 0.9463053
5   2         1 response1 -0.005374208 1.0095468
6   2         1 response2 -0.051699624 1.0154782
7   2         2 response1  0.046622111 0.9848043
8   2         2 response2 -0.055257295 1.0134786


Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:

summarise(group_by(melted, sex, treatment, variable),
mean=mean(value), sd=sd(value))

melted %.% group_by(sex, treatment, variable) %.%
summarise(mean=mean(value), sd=sd(value))


The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R.