# Using R: tibbles and the t.test function

A participant in the R course I’m teaching showed me a case where a tbl_df (the new flavour of data frame provided by the tibble package; standard in new RStudio versions) interacts badly with the t.test function. I had not seen this happen before. The reason is this:

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame (tibble announcement on RStudio blog)

Here is code that reproduces the situation (tibble version 1.2):

data(chickwts)
chick_tibble <- as_tibble(chickwts)
casein <- subset(chickwts, feed == "casein")
sunflower <- subset(chick_tibble, feed == "sunflower")
t.test(sunflower$weight, casein$weight) ## this works
t.test(as.data.frame(sunflower[, 1]), as.data.frame(casein[, 1])) ## this works too
t.test(sunflower[, 1], casein[, 1]) ## this doesn't


Error: Unsupported use of matrix or array for column indexing

I did not know that. The solution, which they found themselves, is to use as.data.frame.

transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+", transcript$V4, transcript$V5) write.gff(transcript.start, file="ensembl_transcript_start.gtf")  ### Finding distance with BEDTools Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It’s a command line tool, and if you haven’t already you will need to download and compile it, which I recall being painless. bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ”>” to capture it in a text file. Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file. #!/bin/bash Rscript clean_files.R bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \ > closest_element_3391.txt bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \ > closest_element_3392.txt  ### Some results With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site. closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t") library(ggplot2) qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) + facet_wrap(~V2, scale="free_y")  Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks. closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t") combined <- merge(closest_3391, closest_3392, by.x=c("V1", "V2","V4", "V5", "V9"), by.y=c("V1", "V2","V4", "V5", "V9")) qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)  (If you’re wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.) ### About the DGRP The question mentioned the DGRP. I don’t know that anyone has looked at ChIP in the DGRP lines, but wouldn’t that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL … What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don’t know if that would tell you much — does anyone know what kind of variant would affect Hp1 binding? Happy hacking! # More fun with %.% and %>% The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%: library(dplyr) data %.% melt(id.vars=c("treatment", "sex")) %.% group_by(sex, treatment, variable) %.% summarise(mean(value))  I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument. data %.% lm(formula=response1 ~ factor(sex)) %.% summary()  As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr: library(magrittr) data %>% lm(response1 ~ factor(sex), .) %>% summary()  Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development. # Using R: quickly calculating summary statistics (with dplyr) I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information. So, here comes the code to do the thing we did yesterday but with dplyr: ## The code for the toy data is exactly the same data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) ## reshape2 still does its thing: library(reshape2) melted <- melt(data, id.vars=c("sex", "treatment")) ## This part is new: library(dplyr) grouped <- group_by(melted, sex, treatment) summarise(grouped, mean=mean(value), sd=sd(value))  When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it: Source: local data frame [4,000 x 4] Groups: sex, treatment, variable sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407 7 1 1 response1 -0.07390246 8 1 2 response1 -1.34509686 9 1 1 response1 1.97215697 10 1 2 response1 -0.08145883  The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping. The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values. Source: local data frame [8 x 5] Groups: sex, treatment sex treatment variable mean sd 1 1 1 response1 0.021856280 1.0124371 2 1 1 response2 0.045928150 1.0151670 3 1 2 response1 -0.065017971 0.9825428 4 1 2 response2 0.011512867 0.9463053 5 2 1 response1 -0.005374208 1.0095468 6 2 1 response2 -0.051699624 1.0154782 7 2 2 response1 0.046622111 0.9848043 8 2 2 response2 -0.055257295 1.0134786  Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls: summarise(group_by(melted, sex, treatment, variable), mean=mean(value), sd=sd(value)) melted %.% group_by(sex, treatment, variable) %.% summarise(mean=mean(value), sd=sd(value))  The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R. # Using R: quickly calculating summary statistics from a data frame A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly? Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing. We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables. data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) head(data)   sex treatment response1 response2 1 1 1 -0.15668214 -0.13663012 2 1 2 -0.40934759 -0.07220426 3 1 1 0.07103731 -2.60549018 4 1 2 0.15113270 1.81803178 5 1 1 0.30836910 0.32596016 6 1 2 -1.41891407 1.12561812  Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead. mean(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1)/
sqrt(nrow(subset(data, sex == 1 & treatment == 1)))


Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr)
library(reshape2)


First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))


The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

  sex treatment  variable       value
1   1         1 response1 -0.15668214
2   1         2 response1 -0.40934759
3   1         1 response1  0.07103731
4   1         2 response1  0.15113270
5   1         1 response1  0.30836910
6   1         2 response1 -1.41891407


Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise,
mean = mean(value), sd = sd(value),
sem = sd(value)/sqrt(length(value)))

  sex treatment  variable         mean        sd        sem
1   1         1 response1  0.021856280 1.0124371 0.04527757
2   1         1 response2  0.045928150 1.0151670 0.04539965
3   1         2 response1 -0.065017971 0.9825428 0.04394065
4   1         2 response2  0.011512867 0.9463053 0.04232006
5   2         1 response1 -0.005374208 1.0095468 0.04514830
6   2         1 response2 -0.051699624 1.0154782 0.04541357
7   2         2 response1  0.046622111 0.9848043 0.04404179
8   2         2 response2 -0.055257295 1.0134786 0.04532414