Archive for the ‘english’ Category
Journal club of one: ”Maternal and additive gentic effects contribute to variation in offspring traits in a lizard”
The posts this week have been about epigenetics. However, let’s step back from the molecular mechanisms and what not to look at the bigger picture. This recent paper by Noble, McFarlane, Keogh and Whiting (2014) looks at maternal effects and additive genetic effects on fitness-related traits in a lizard. Now we are in quantitative genetics territory where one uses pedigrees and phenotypes to look at the determinants of a trait while abstracting away the mechanistic details. Nowadays, quantitative genetics is also equipped with Bayesian animal models and the ability to do parentage assignment with molecular methods.
The authors measured at size, body mass, and growth and as well as the speed and endurance when running. The fun part is that while only endurance had a substantial heritability (0.4), the other traits had maternal components in the 0.2-0.5 range. So for most of the traits there’s little heritability while a big chunk of the trait variance is explained by maternal effects.
I like the idea to include maternal traits to see look at what causes the maternal effect. Clutch size, maternal size and condition seem matter for some trait or another. In two cases the maternal effect is entirely explained away: the effect on growth by birth date and clutch size, and sprint speed by birth date.
The inferences come from an animal model that include a maternal effect. Something I’m curious about is how heritability would be overestimated if the maternal component was not accounted for. That is beside the point of the paper, though.
Another interesting point: I think everyone who deals with animals in some type of controlled environment wonder about how much our measurements differ from what would’ve been measured in a more natural environment. In this case, the authors measured offspring growth both in the test environment and in an enclosure. They find a maternal effect in the test environment, while the interval for the heritability goes from almost zero to 0.5. In the wilder environment they estimate very little genetic and maternal variance, as well as a larger residual variance. I don’t know if this is just because of increased noise, or because maternal effects actually interact with condition.
Also, I love figure 1 (the one figure). If more papers had caterpillar plots of most important estimated quantities, the world would be a better place.
Paper: ”Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens”
Since I love author blog posts about papers, I thought I’d write a little about papers I’ve contributed too. So far, they’re not that many, but maybe it can be a habit.
”Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens” was published in BMC Genomics in 2012. The title says it very well: the paper looks at differential expression and DNA methylation of a subset of genes in the hypothalamus of Red Junglefowl and domestic White Leghorn chickens. My contribution was during my MSc project in the group. Previously (Lindqvist & al 2007; Nätt & al 2009) Daniel Nätt, Pelle Jensen and others found a transgenerational effect of unpredictable light stress on domestic chickens. After that, and being interested in chicken domestication, a DNA methylation comparison of wild and domestic seems like a natural thing to do. And it turns out Red Junglefowl and White Leghorns differ in expression of a bunch of genes and in methylation of certain promoters (where promoter is operationally defined as a region around the start of the gene model). And when looking at two generations, the contrasts are correlated between parent and offspring. There is some heritable basis of the differences in gene expression and DNA methylation.
In Red Junglefowl, ancestor of domestic chickens, gene expression and methylation profiles in thalamus/hypothalamus differed substantially from that of a domesticated egg laying breed. Expression as well as methylation differences were largely maintained in the offspring, demonstrating reliable inheritance of epigenetic variation.
What I did was methylation sensitive high resolution melting. HRM is a typing method based on real time PCR. After PCR you often make a melting curve by ramping up the temperature, denaturing the PCR product. The melting characteristics depend on the sequence, so you can use melting to check that you get the expected PCR product, and it turns out that the difference can be big enough to type SNPs. And if you can type SNPs, you can analyse DNA methylation. So we treat the DNA with bisulfite, which deaminates cytosines to uracil unless they are protected by methylation, and get a converted sequence where an unmethylated C is like a C>T SNP. We set up standard curves with a mixture of whole-genome amplified and in vitro methylated DNA and measured the degree of methylation.
That is averaging over the population of DNA molecules in the sample; I’ve been wondering how HRM performs when the CpGs in the amplicon have heterogenous methylation differences. We’ve used HRM for genotyping as well, and it works, but we’ve switched to pyrosequencing, which gives cleaner results and where the assay design is much easier to get right the first time. I don’t know whether the same applies for methylation analysis with pyro.
My favourite part of the paper is figure 4b (licence: cc:by 2.0) which shows methylation analysis in the advanced intercross of Red Junglefowl and White Leghorns, which immediately leads to, as mentioned in the paper, the thought of DNA methylation QTL mapping.
Nätt, D., Rubin, C. J., Wright, D., Johnsson, M., Beltéky, J., Andersson, L., & Jensen, P. (2012). Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens. BMC genomics, 13(1), 59.
Lindqvist C, Janczak AM, Nätt D, Baranowska I, Lindqvist N, et al. (2007) Transmission of Stress-Induced Learning Impairment and Associated Brain Gene Expression from Parents to Offspring in Chickens. PLoS ONE 2(4): e364. doi:10.1371/journal.pone.0000364
Nätt D, Lindqvist N, Stranneheim H, Lundeberg J, Torjesen PA, et al. (2009) Inheritance of Acquired Behaviour Adaptations and Brain Gene Expression in Chickens. PLoS ONE 4(7): e6405. doi:10.1371/journal.pone.0006405
What is is that is so scandalous about epigenetic inheritance? Not much, in my opinion. Some of the points on the spectrum clearly happen in the wild: stable and fluctuating epigenetic inheritance in plants, parental effects in animals and genomic imprinting in both. Widespread epigenetic inheritance in animals would change a lot of things, of course, but even if epigenetic inheritance turns out to be really important and common, genetics and evolution as we know them will not break. The tools to study and understand them are there.
Looking back at the post from yesterday, there are different flavours of epigenetic inheritance. At the most heritable end of the spectrum, epigenetic variants behave pretty much like genetic variants. Because quantitative genetics is agnostic to the molecular nature of the variants, as long as they behave like an inheritance system, most high-level genetic analysis will work the same. It’s just that on the molecular level, one would have to look to epigenetic marks, not to sequence changes, for the causal variant. Even if a substantial proportion of the genetic variance is caused by epigenetic variants rather than DNA sequence variants, this would not be a revolution that changes genetics or evolution into something incommensurable with previous thought.
The most revolutionary potential lies somewhere in the middle of the scale, in parental effects with really high fidelity of transmission that are potentially responsive to the environment, but in principle these things can still be dealt with by the same theoretical tools. Most people just didn’t think they were that important. How about soft inheritance? It seems dramatic, but all examples deal with specific programmed mechanisms: soft inheritance of the sensitivity to a particular odour or of the DNA methylation and expression state of a particular locus. No-one has yet suggested a generalised Lamarckian mechanism; that is still out of the question. DNA mutations are still unable to pass from somatic cells to gametes. Whatever tricks transgenerational mechanisms use to skip over the soma–germline distinction, they must be pretty exceptional. Discoveries of widespread soft inheritance in nature would be surprising, a cause for rethinking certain things and great fun. But conceptually, it is parental effects writ large. We can understand that. We have the technology.
Let us think aloud about the different possible meanings of epigenetic inheritance. I don’t want to contribute to unnecessary proliferation of terminology — people have already coined molar/molecular epigenetics (Crews 2009), intergenerational/transgenerational effects (Heard & Martienssen 2014), and probably several more dichotomies. But I thought it could be instructive to try to think about epigenetic inheritance in terms of the contribution it could make to variance components of a quantitative genetic model. After all, quantitative genetics is mostly agnostic about the molecular nature of the heritable variation.
At one end of the spectrum we find molecular epigenetic marks such as DNA methylation, as they feature in the normal development of the organism. Regardless of how faithfully they are transmitted through mitosis, or even if they pass through meiosis, they only contribute to individual variation if they are perturbed in different ways between individuals. If they do vary between individuals, though, in a fashion that is not passed on to the offspring, they will end up in the environmental variance component.
What about transmissible variation? There are multiple non-genetic ways for information to be passed a single generation: maternal or paternal effects need not be epigenetic in the molecular sense. They could be, like genomic imprinting, but they could also be caused by some biomolecule in the sperm, something that passes the blood–placenta barrier or something deposited by the mother into the egg. Transgenerational effects of this kind make related individuals more similar, they will affect the genetic variance component unless they are controlled. And in the best possible world of experimental design, parental effects can be controlled and modelled, and we can in principle separate out the maternal, paternal and genetic component. Think of effects like in Weaver & al (2004) that are perpetuated by maternal behaviour. If the behavioural transmission is strong enough they might form a pretty stable heritable effect that would appear in the genetic variance component if it’s not broken up by cross-fostering.
However, if the variation behaves like germ-line variation it will be irreversible by cross-fostering, inseparable from the genetic variance component, and it will have the potential to form a genuine parallel inheritance system. The question is: how stable will it be? Animals seem to be very good at resetting the epigenetic germline each generation. The most provocative suggestion is probably some type of variation that is both faithfully transmitted and sometimes responsive to the environment. Responsiveness means less fidelity of transmission, though, and it seems (Slatkin 2009) like epigenetic variants need to be stable for many generations to make any lasting impact on heritability. Then, at the heritable end of the spectrum, we find epigenetic variants that arise from some type of random mutation event and are transmitted faithfully through the germline. If they exist, they will behave just like any genetic variants and even have a genomic locus.
I’ve had this blog since 2010, but it was not until last year that I started writing anything else than popular/science in Swedish. There is lots of discussion on academic blogs about whether PhD students, or any academics, should write on blogs or not and also quite a bit of fear, uncertainty and doubt going around. This is what I think: I don’t think my blog is such a big deal. It’s just a small hobby project that makes me happy. And while I hope it doesn’t hurt my research or my chances to continue doing science, I don’t think it helps them much either.
Do I have a target audience? There was recently a small survey to find what academics blog about and why; they found that most blogs were directed at peers, not for outreach. I’m not surprised. As I’ve already mentioned, my posts in Swedish are more popular/science, less technical and sometimes deal with things published in Swedish media. I think the target audience is still geeks of some kind, but not necessarily genetics geeks. My posts in English are more directed at academical things, either related to my research and work as a PhD student or about the R language. So my posts are a mix of languages and themes. Is that a problem? From a popularity or readership perspective, probably yes. I can see little reason not to split the posts to two blogs, each concentrated on one theme, except that I don’t feel like running two blogs.
Does blogging hurt me because it hurts my work? I hardly think so. First, blogging is not part of my duties at the university, and I don’t do it instead of writing, working in the lab or analysing data. I do it in the evening after work, or in the case of some posts in the morning before. I’m not convinced blogging makes me in any way a better scientist, but it can hardly make me worse. Thinking about science or how to explain it for another hour now and then can’t hurt. And yes, the time spent blogging could theoretically be spent writing papers or something, but so could theoretically the time spent at the gym, with family or friends. If we grant that academics do other things, blogging could be one of those activities. My blog is not completely disconnected from my work, but I think it’s disconnected enough to be regarded as a fun pastime.
Does blogging hurt my reputation because people might read my blog and disapprove? I don’t think that many people read my blog; actually, I know that not many people do. Still, it is certainly possible that some of the readers might be important to my career and that they don’t like what they see. It will be found when people look me up with a search engine. Maybe someone thinks that I’m wasting my time, or maybe I’ve written something controversial — or more likely, something stupid. I think and say things that are mistaken all the time, and some of those mistakes might end up in a blog post. The point is, though, that expressing my opinion about things I care about is not something I do because I think it’ll further my career. I do it because I want to. If my writing is successful, the things on my blog will be the kinds of things I honestly know, think and believe about science.
The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:
library(dplyr) data %.% melt(id.vars=c("treatment", "sex")) %.% group_by(sex, treatment, variable) %.% summarise(mean(value))
I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.
data %.% lm(formula=response1 ~ factor(sex)) %.% summary()
As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:
library(magrittr) data %>% lm(response1 ~ factor(sex), .) %>% summary()
Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development.
I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.
So, here comes the code to do the thing we did yesterday but with dplyr:
## The code for the toy data is exactly the same data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) ## reshape2 still does its thing: library(reshape2) melted <- melt(data, id.vars=c("sex", "treatment")) ## This part is new: library(dplyr) grouped <- group_by(melted, sex, treatment) summarise(grouped, mean=mean(value), sd=sd(value))
When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:
Source: local data frame [4,000 x 4] Groups: sex, treatment, variable sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407 7 1 1 response1 -0.07390246 8 1 2 response1 -1.34509686 9 1 1 response1 1.97215697 10 1 2 response1 -0.08145883
The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.
The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.
Source: local data frame [8 x 5] Groups: sex, treatment sex treatment variable mean sd 1 1 1 response1 0.021856280 1.0124371 2 1 1 response2 0.045928150 1.0151670 3 1 2 response1 -0.065017971 0.9825428 4 1 2 response2 0.011512867 0.9463053 5 2 1 response1 -0.005374208 1.0095468 6 2 1 response2 -0.051699624 1.0154782 7 2 2 response1 0.046622111 0.9848043 8 2 2 response2 -0.055257295 1.0134786
Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:
summarise(group_by(melted, sex, treatment, variable), mean=mean(value), sd=sd(value)) melted %.% group_by(sex, treatment, variable) %.% summarise(mean=mean(value), sd=sd(value))
The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R.
A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?
Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.
We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.
data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) head(data)
sex treatment response1 response2 1 1 1 -0.15668214 -0.13663012 2 1 2 -0.40934759 -0.07220426 3 1 1 0.07103731 -2.60549018 4 1 2 0.15113270 1.81803178 5 1 1 0.30836910 0.32596016 6 1 2 -1.41891407 1.12561812
Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.
mean(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1)/ sqrt(nrow(subset(data, sex == 1 & treatment == 1)))
Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.
First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:
melted <- melt(data, id.vars=c("sex", "treatment"))
The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.
sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407
Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.
The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!
ddply(melted, c("sex", "treatment", "variable"), summarise, mean = mean(value), sd = sd(value), sem = sd(value)/sqrt(length(value)))
sex treatment variable mean sd sem 1 1 1 response1 0.021856280 1.0124371 0.04527757 2 1 1 response2 0.045928150 1.0151670 0.04539965 3 1 2 response1 -0.065017971 0.9825428 0.04394065 4 1 2 response2 0.011512867 0.9463053 0.04232006 5 2 1 response1 -0.005374208 1.0095468 0.04514830 6 2 1 response2 -0.051699624 1.0154782 0.04541357 7 2 2 response1 0.046622111 0.9848043 0.04404179 8 2 2 response2 -0.055257295 1.0134786 0.04532414
Ah, the barplot. Loved by some, hated by some, the first graph you’re likely to make in your favourite office spreadsheet software, but a rather tricky one to pull off in R. Or, that depends. If you just need a barplot that displays the value of each data point as a bar — which is one situation where I like a good barplot — the barplot( ) function does just that:
some.data <- rnorm(10, 4, 1.5) names(some.data) <- 1:10 barplot(some.data)
Done? Not really. The barplot (I know some people might not use the word plot for this type of diagram, but I will) one typically sees from a spreadsheet program has some gilding: it’s easy to get several variables (”series”) of data in the same plot, and often you’d like to see error bars. All this is very possible in R, either with base graphics, lattice or ggplot2, but it requires a little more work. As usual when it gets a bit more fancy, I prefer ggplot2 over the alternatives. Once upon a time when I started with ggplot2, I tried googling for this, and lots of people have answered this question. I was still confused, though. So, if you’re a new user and reading this, please bear with me and I’ll try to demonstrate what all the steps are good for. Whether it’s a good statistical graph or not, the barplot is actually a nice example of ggplot2 in action and will demonstrate some R principles.
Let us take an example: Say that we start with a pretty typical small dataset with two variables that we’ve measured in four groups. Now we’d like a barplot of the group means and error bars for the means.
0. Start a script
Making the plot will take more than a couple of lines, so it’s a good idea to put everything in a script. Below I will split the script into chunks, but the whole thing is on github. We make a new R file and load ggplot2, plyr and reshape2, the packages we will need:
library(ggplot2) library(plyr) library(reshape2)
1. Simulate some data
In the case of real barplot this is where you load your data. You will probably have it in a text file that you read with the read.table( ) family of functions or RStudios Import dataset button (which makes the read.table call for you; if you don’t feel like late nights hunched over the read.table manual page, I recommend it). Simulating data might look something like this:
n <- 10 group <- rep(1:4, n) mass.means <- c(10, 20, 15, 30) mass.sigma <- 4 score.means <- c(5, 5, 7, 4) score.sigma <- 3 mass <- as.vector(model.matrix(~0+factor(group)) %*% mass.means) + rnorm(n*4, 0, mass.sigma) score <- as.vector(model.matrix(~0+factor(group)) %*% score.means) + rnorm(n*4, 0, score.sigma) data <- data.frame(id = 1:(n*4), group, mass, score)
This code is not the tersest possible, but still a bit tricky to read. If you only care about the barplot, skip over this part. We define the number of individuals per group (10), create a predictor variable (group), set the true mean and standard deviation of each variable in each group and generate values from them. The values are drawn from a normal distribution with the given mean and standard deviation. The model.matrix( ) function returns a design matrix, what is usually called X in a linear model. The %*% operator is R’s way of denoting matrix multiplication — to match the correct mean with the predictor, we multiply the design matrix by the vector of means. Now that we’ve got a data frame, we pretend that we don’t know the actual values set above.
id group mass score 1 1 1 4.2367813 5.492707 2 2 2 16.4357254 1.019964 3 3 3 19.2491831 6.936894 4 4 4 23.4757636 3.845321 5 5 1 0.9533737 1.852927 6 6 2 19.9142350 5.567024
2. Calculate means
The secret to a good plot in ggplot2 is often to start by rearranging the data. Once the data is in the right format, mapping the columns of the data frame to the right element of the plot is the easy part. In this case, what we want to plot is not the actual data points, but a function of them — the group means. We could of course subset the data eight times (four groups times two variables), but thankfully, plyr can do that for us. Look at this piece of code:
melted <- melt(data, id.vars=c("id", "group")) means <- ddply(melted, c("group", "variable"), summarise, mean=mean(value))
First we use reshape2 to melt the data frame from tabular form to long form. The concept is best understood by comparing the output and input of melt( ). Compare the rows above to these rows, which are from the melted data frame:
id group variable value 1 1 1 mass 4.2367813 2 2 2 mass 16.4357254 3 3 3 mass 19.2491831 4 4 4 mass 23.4757636
We’ve gone from storing two values per row (mass and score) to storing one value (mass or score), keeping the identifying variables (id and group) in each row. This might seem tricky (or utterly obvious if you’ve studied database design), but you’ll soon get used to it. Trust me, if you do, it will prove useful!
The second row uses ddply (”apply from data frame to data frame”) to split up the melted data by all combinations of group and variable and calculate a function of the value, in this case the mean. The summarise function creates a new data frame from an old; the arguments are the new columns to be calculated. That is, it does exactly what it says, summarises a data frame. If you’re curious, try using it directly. It’s not very useful on its own, but very good in ddply calls.
3. Barplot of the means
Time to call on ggplot2! One has a choice between using qplot( ) or ggplot( ) to build up a plot, but qplot is the easier. We map the mean to y, the group indicator to x and the variable to the fill of the bar. The bar geometry defaults to counting values to make a histogram, so we need to tell use the y values provided. That’s what setting stat= to ”identity” is good for. To make the bars stand grouped next to each other instead of stacking, we tell set position=.
means.barplot <- qplot(x=group, y=mean, fill=variable, data=means, geom="bar", stat="identity", position="dodge")
4. Standard error of the mean
Some people can argue for hours about error bars. In some cases you will want other types of error bars. Maybe the inferences come from a hierarchical model where the standard errors are partially pooled. Maybe you’re dealing with some type of generalised linear model or a model made with transformed data. See my R tutorial for a simple example with anova. The point is that from the perspective of ggplot2 input to the error bars is data, just like anything else, and we can use the full arsenal of R tools to create them.
means.sem <- ddply(melted, c("group", "variable"), summarise, mean=mean(value), sem=sd(value)/sqrt(length(value))) means.sem <- transform(means.sem, lower=mean-sem, upper=mean+sem)
First, we add a standard error calculation to the ddply call. The transform function adds colums to a data frame; we use it to calculate the upper and lower limit to the error bars (+/- 1 SEM). Then back to ggplot2! We add a geom_errorbar layer with the addition operator. This reveals some of the underlying non-qplot syntax of ggplot2. The mappings are wrapped in the aes( ), aesthetics, function and the other settings to the layer are regular arguments. The data argument is the data frame with interval limits that we made above. The only part of this I don’t like is the position_dodge call. What it does is nudge the error bars to the side so that they line up with the bars. If you know a better way to get this behaviour without setting a constant, please write me a comment!
means.barplot + geom_errorbar(aes(ymax=upper, ymin=lower), position=position_dodge(0.9), data=means.sem)
Does this seem like a lot of code? If we look at the actual script and disregard the data simulation part, I don’t think it’s actually that much. And if you make this type of barplot often, you can package this up into a function.
Several people have asked: what scripting language should biologists learn if they are interested in doing a little larger-scale data analysis and have never programmed before? I’m not an expert, but these are the kinds of things I tend to say:
The language is not so important; the same principles apply everywhere. Use what your friends and colleagues use so you can get help from them. I believe most people would answer Python. I would answer R. Don’t believe people who tell you that R is not a serious language. You’re already familiar with analysing small datasets in a statistics program. You can do that in R too, and then the step to writing code and handling larger projects is actually very short. Your data will very likely come in tables, and R is very good at that. You’ll also want pretty graphs, and R is very good at that too. Regardless, have a look at the other common languages as well. Practice working from a terminal.