Posts Tagged ‘ggplot2’
Ah, the barplot. Loved by some, hated by some, the first graph you’re likely to make in your favourite office spreadsheet software, but a rather tricky one to pull off in R. Or, that depends. If you just need a barplot that displays the value of each data point as a bar — which is one situation where I like a good barplot — the barplot( ) function does just that:
some.data <- rnorm(10, 4, 1.5) names(some.data) <- 1:10 barplot(some.data)
Done? Not really. The barplot (I know some people might not use the word plot for this type of diagram, but I will) one typically sees from a spreadsheet program has some gilding: it’s easy to get several variables (”series”) of data in the same plot, and often you’d like to see error bars. All this is very possible in R, either with base graphics, lattice or ggplot2, but it requires a little more work. As usual when it gets a bit more fancy, I prefer ggplot2 over the alternatives. Once upon a time when I started with ggplot2, I tried googling for this, and lots of people have answered this question. I was still confused, though. So, if you’re a new user and reading this, please bear with me and I’ll try to demonstrate what all the steps are good for. Whether it’s a good statistical graph or not, the barplot is actually a nice example of ggplot2 in action and will demonstrate some R principles.
Let us take an example: Say that we start with a pretty typical small dataset with two variables that we’ve measured in four groups. Now we’d like a barplot of the group means and error bars for the means.
0. Start a script
Making the plot will take more than a couple of lines, so it’s a good idea to put everything in a script. Below I will split the script into chunks, but the whole thing is on github. We make a new R file and load ggplot2, plyr and reshape2, the packages we will need:
library(ggplot2) library(plyr) library(reshape2)
1. Simulate some data
In the case of real barplot this is where you load your data. You will probably have it in a text file that you read with the read.table( ) family of functions or RStudios Import dataset button (which makes the read.table call for you; if you don’t feel like late nights hunched over the read.table manual page, I recommend it). Simulating data might look something like this:
n <- 10 group <- rep(1:4, n) mass.means <- c(10, 20, 15, 30) mass.sigma <- 4 score.means <- c(5, 5, 7, 4) score.sigma <- 3 mass <- as.vector(model.matrix(~0+factor(group)) %*% mass.means) + rnorm(n*4, 0, mass.sigma) score <- as.vector(model.matrix(~0+factor(group)) %*% score.means) + rnorm(n*4, 0, score.sigma) data <- data.frame(id = 1:(n*4), group, mass, score)
This code is not the tersest possible, but still a bit tricky to read. If you only care about the barplot, skip over this part. We define the number of individuals per group (10), create a predictor variable (group), set the true mean and standard deviation of each variable in each group and generate values from them. The values are drawn from a normal distribution with the given mean and standard deviation. The model.matrix( ) function returns a design matrix, what is usually called X in a linear model. The %*% operator is R’s way of denoting matrix multiplication — to match the correct mean with the predictor, we multiply the design matrix by the vector of means. Now that we’ve got a data frame, we pretend that we don’t know the actual values set above.
id group mass score 1 1 1 4.2367813 5.492707 2 2 2 16.4357254 1.019964 3 3 3 19.2491831 6.936894 4 4 4 23.4757636 3.845321 5 5 1 0.9533737 1.852927 6 6 2 19.9142350 5.567024
2. Calculate means
The secret to a good plot in ggplot2 is often to start by rearranging the data. Once the data is in the right format, mapping the columns of the data frame to the right element of the plot is the easy part. In this case, what we want to plot is not the actual data points, but a function of them — the group means. We could of course subset the data eight times (four groups times two variables), but thankfully, plyr can do that for us. Look at this piece of code:
melted <- melt(data, id.vars=c("id", "group")) means <- ddply(melted, c("group", "variable"), summarise, mean=mean(value))
First we use reshape2 to melt the data frame from tabular form to long form. The concept is best understood by comparing the output and input of melt( ). Compare the rows above to these rows, which are from the melted data frame:
id group variable value 1 1 1 mass 4.2367813 2 2 2 mass 16.4357254 3 3 3 mass 19.2491831 4 4 4 mass 23.4757636
We’ve gone from storing two values per row (mass and score) to storing one value (mass or score), keeping the identifying variables (id and group) in each row. This might seem tricky (or utterly obvious if you’ve studied database design), but you’ll soon get used to it. Trust me, if you do, it will prove useful!
The second row uses ddply (”apply from data frame to data frame”) to split up the melted data by all combinations of group and variable and calculate a function of the value, in this case the mean. The summarise function creates a new data frame from an old; the arguments are the new columns to be calculated. That is, it does exactly what it says, summarises a data frame. If you’re curious, try using it directly. It’s not very useful on its own, but very good in ddply calls.
3. Barplot of the means
Time to call on ggplot2! One has a choice between using qplot( ) or ggplot( ) to build up a plot, but qplot is the easier. We map the mean to y, the group indicator to x and the variable to the fill of the bar. The bar geometry defaults to counting values to make a histogram, so we need to tell use the y values provided. That’s what setting stat= to ”identity” is good for. To make the bars stand grouped next to each other instead of stacking, we tell set position=.
means.barplot <- qplot(x=group, y=mean, fill=variable, data=means, geom="bar", stat="identity", position="dodge")
4. Standard error of the mean
Some people can argue for hours about error bars. In some cases you will want other types of error bars. Maybe the inferences come from a hierarchical model where the standard errors are partially pooled. Maybe you’re dealing with some type of generalised linear model or a model made with transformed data. See my R tutorial for a simple example with anova. The point is that from the perspective of ggplot2 input to the error bars is data, just like anything else, and we can use the full arsenal of R tools to create them.
means.sem <- ddply(melted, c("group", "variable"), summarise, mean=mean(value), sem=sd(value)/sqrt(length(value))) means.sem <- transform(means.sem, lower=mean-sem, upper=mean+sem)
First, we add a standard error calculation to the ddply call. The transform function adds colums to a data frame; we use it to calculate the upper and lower limit to the error bars (+/- 1 SEM). Then back to ggplot2! We add a geom_errorbar layer with the addition operator. This reveals some of the underlying non-qplot syntax of ggplot2. The mappings are wrapped in the aes( ), aesthetics, function and the other settings to the layer are regular arguments. The data argument is the data frame with interval limits that we made above. The only part of this I don’t like is the position_dodge call. What it does is nudge the error bars to the side so that they line up with the bars. If you know a better way to get this behaviour without setting a constant, please write me a comment!
means.barplot + geom_errorbar(aes(ymax=upper, ymin=lower), position=position_dodge(0.9), data=means.sem)
Does this seem like a lot of code? If we look at the actual script and disregard the data simulation part, I don’t think it’s actually that much. And if you make this type of barplot often, you can package this up into a function.
Apparently, this turned out to be my most popular post ever. Of course there are lots of things to say about the heatmap (or quilt, tile, guilt plot etc), but what I wrote was literally just a quick celebratory post to commemorate that I’d finally grasped how to combine reshape2 and ggplot2 to quickly make this colourful picture of a correlation matrix.
However, I realised there is one more thing that is really needed, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. We use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, it’s very easy to get a scale that has a midpoint and a different colour in each direction. It’s called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!
data <- airquality[,1:4] library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(data, use="p")), fill=value, geom="tile") + scale_fill_gradient2(limits=c(-1, 1))
I recently got an email from a person at Packt publishing, who suggested I write a book for them about ggplot2. My answer, which is perfectly true, is that I don’t have the time, nor the expertise to do that. What I didn’t say is that 1) a quick web search suggests that Packt doesn’t have the best reputation and 2) there are already two books about ggplot2 that I think covers the entire field: the indispensable ggplot2 book, written by Hadley Wickham, the author of the package, and the R Graphics Cookbok by Wincent Chang. There are too many decent but not great R books on the market already and there is no reason for me to spend time to create another one.
However, there are a few things I’d like to tell the novice, in general, about ggplot2:
1. ggplot2 is not necessarily superior to lattice or base graphics. It’s largely a matter of taste, you can make very nice plots with either plotting system and in some situations everybody will turn away from their favourite system and use another. For instance, a lot of built-in diagnostic plots come preprogrammed in base graphics. The great thing about ggplot2 is how it allows the user to think about plots as layers of mappings between variables and geometric objects. Base graphics and lattice are organised in different ways, which is not necessarily worse; it depends on the way you’re accustomed to thinking about statistical graphics.
2. There are two ways to start a plot in ggplot2: qplot( ) or ggplot( ). qplot is the quicker way and probably the one you should learn first. But don’t be afraid of the ggplot function! All it does is set up an empty plot and connect a data frame to it. After that you can add your layers and map your variables almost the same way you would do with qplot. After you’ve become comfortable with qplot, just try building plots with ggplot a few times, and you’ll see how similar it is.
3. The magic is not in plotting the data but in tidying and rearranging the data for plotting. Make sure to put all your labels, indicators and multiple series of data into the same data frame (most of the time: just cbind or merge them together), melt the data frame and pass it to ggplot2. If you want to layer predictions on top of your data, put output from your model in another data frame. It is perfectly possible, often advisable, to write functions that generate ggplot2 plots, but make sure to always create the data frame to be plotted first and then pass it on. I suggest not trying to create the mappings, that is x= and y= and the like, on the fly. There is always the risk of messing up the order of the vectors, and also, because ggplot2 uses metaprogramming techniques for the aesthetics, you might see unexpected behaviours when putting function calls into the mapping.
4. Worry about mapping variables and facetting first and then change the formatting. Because of how plots as well as settings in ggplot2 are objects that you can store and pass around, you can first create a raw version of the plot, and then just add (yes add, with the ”+” operator) the formatting options you want. So taken together, the workflow for making any ggplot2 plot goes something like this: 1) put your data frame in order; 2) set up a basic plot with the qplot or ggplot function; 3) add one extra geom at at time (optional if you make a simple plot with qplot, since qplot sets up the first geometry for you); 4) add the settings needed to make the plot look good.
Someone asked about this and I though the solution with ggplot2 was pretty neat. Imagine that you have a scatterplot with some points in the exact same coordinates, and to reduce overplotting you want to have the size of the dot indicating the number of data points that fall on it. At the same time you want to colour the points according to some categorical variable.
The sizeplot function in the plotrix package makes this type of scatterplot. However, it doesn’t do the colouring easily. I’m sure it’s quite possible with a better knowledge of base graphics, but I tend to prefer ggplot2. To construct the same type of plot we need to count the data points. For this, I use table( ), and then melt the contingency table and remove the zeroes.
library(ggplot2) library(reshape2) data <- data.frame(x=c(0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4), y=c(0, 0, 0, 3, 1, 1, 1, 2, 2, 1, 4, 4), group=c(rep(1, 6), rep(2, 4), rep(3, 2))) counts <- melt(table(data[1:2])) colnames(counts) <- c(colnames(data)[1:2], "count") counts <- subset(counts, count != 0) sizeplot <- qplot(x=x, y=y, size=count, data=counts) + scale_size(range=c(5, 10))
This is the first sizeplot. (The original scale makes single points very tiny. Hence the custom scale for size. Play with the range values to taste!) To add colour, we merge the counts with the original data to get back the group information — and, in true ggplot2 fashion, map the group variable to colour.
counts.and.groups <- merge(counts, unique(data)) sizeplot.colour <- qplot(x=x, y=y, size=count, colour=factor(group), data=counts.and.groups) + scale_size(range=c(5, 10))
One thing that this simple script does not handle well is if points that should have different colour happen to overlap. (As it stands, this code will actually plot two points both the size of the total number of overlapping points in different colours on top of each other. That must be wrong in several ways.) However, I don’t know what would be the best behaviour in this instance. Maybe to count the number of overlaps separately and plot both points while adding some transparency to the points?
In the last episode (which was quite some time ago) we looked into comparisons of means with linear models. This time, let’s visualise some linear models with ggplot2, and practice another useful R skill, namely how to simulate data from known models. While doing this, we’ll learn some more about the layered structure of a ggplot2 plot, and some useful thing about the lm function.
11. Using points, lines and error bars to show predictions from linear models
Return to the model of comb gnome mass at time zero. We’ve already plotted the coefficient estimates, but let us just look at them with the coef() function. Here the intercept term is the mean for green comb gnomes subjected to the control treatment. The ‘grouppink’ and ‘treatmentpixies’ coefficients are the mean differences of pink comb gnomes and comb gnomes exposed to pixies from this baseline condition. This way of assigning coefficients is called dummy coding and is the default in R.
model <- lm(mass0 ~ group + treatment, data) coef(model)
(Intercept) grouppink treatmentpixies 141.56771 -49.75414 23.52428
The estimate for a pink comb gnome with pixies is:
coef(model) + coef(model) + coef(model)
There are alternative codings (”contrasts”) that you can use. A common one in Anova is to use the intercept as the grand mean and the coefficients as deviations from the mean. (So that the coefficients for different levels of the same factor sum to zero.) We can get this setting in R by changing the contrasts option, and then rerun the model. However, whether the coefficients are easily interpretable or not, they still lead to the same means, and we can always calculate the values of the combinations of levels that interest us.
Instead of typing in the formulas ourself as above, we can get predictions from the model with the predict( ) function. We need a data frame of the new values to predict, which in this case means one row for each combination of the levels of group and treatment. Since we have too levels each there are only for of them, but in general we can use the expand.grid( ) function to generate all possible factor levels. We’ll then get the predictions and their confidence intervals, and bundle everything together to one handy data frame.
levels <- expand.grid(group=c("green", "pink"), treatment=c("control", "pixies")) predictions <- predict(model, levels, interval="confidence") predicted.data <- cbind(levels, predictions)
group treatment fit lwr upr 1 green control 141.56771 125.82527 157.3101 2 pink control 91.81357 76.48329 107.1439 3 green pixies 165.09199 149.34955 180.8344 4 pink pixies 115.33785 98.93425 131.7414
Now that we have these intervals in a data frame we can plot them just like we would any other values. Back in part II, we put several categorical variables into the same plot by colouring the points. Now, let’s introduce nice feature of ggplot2: making small multiples with faceting. qplot( ) takes facets argument which is a formula where the left hand side, before the tilde (‘~’), will be used to split the plot vertically, and the right hand side will split the plot horizontally. In this case, we split horizontally, each panel representing one level of the treatment variable. Also, we use a new geometry: pointrange, which draws a point with bars above and below it and is quite suitable for the intervals we’ve got.
qplot(x=treatment, facets=~group, y=fit, ymax=upr, ymin=lwr geom="pointrange", data=predicted.data)
That’s good, but combining the predictions from the model and the actual data in the same plot would be nice. In ggplot2, every plot is an object that can be saved away to a variable. Then we can use the addition operator to add layers to the plot. Let’s make a jittered dotplot like the above and then add a layer with the pointrange geometry displaying confidence intervals. The scatter of the data points around the confidence intervals reminds us that there is quite a bit of residual variance. The coefficient of determination, as seen in the summary earlier, was about 0.25.
qplot(x=treatment, y=mass0, facets=~group, geom="jitter", data=data) + geom_pointrange(aes(y=fit, ymax=upr, ymin=lwr), colour="red", data=predicted.data)
In the above, we make use of ggplot2’s more advanced syntax for specifying plots. The addition operator adds layers. The first layer can be set up with qplot(), but the following layers are made with their respective functions. Mapping from variables to features of the plot, called aesthetics, have to be put inside the aes() function. This might look a bit weird in the beginning, but it has its internal logic — all this is described in Hadley Wickham’s ggplot2 book.
We should probably try a regression line as well. The abline geometry allows us to plot a line with given intercept and slope, i.e. the coefficients of a simple regression. Let us simplify a little and look at the mass at time zero and the log-transformed mass at time 50 in only the green group. We make a linear model that uses the same slope for both treatments and a treatment-specific intercept. (Exercise for the reader: look at the coefficients with coef( ) and verify that I’ve pulled out the intercepts and slope correctly.) Finally, we plot the points with qplot and add the lines one layer at the time.
green.data <- subset(data, group=="green") model.green <- lm(log(mass50) ~ mass0 + treatment, green.data) intercept.control <- coef(model.green) intercept.pixies <- coef(model.green)+coef(model.green) qplot(x=mass0, y=log(mass50), colour=treatment, data=green.data) + geom_abline(intercept=intercept.pixies, slope=coef(model.green)) + geom_abline(intercept=intercept.control, slope=coef(model.green))
12. Using pseudorandom numbers for sanity checking
There is a short step from playing with regression functions that we’ve fitted, like we did above, to making up hypothetical regression functions and simulating data from them. This type of fake-data simulation is very useful to for testing how designs and estimation procedures behave and check things like the control of false positive rate and the power to accurately estimate a known model.
The model will be the simplest possible: a single categorical predictor with only two levels and normally distributed equal error variance, i.e. a t-test. There is a formula for the power of the t-test and an R function, power.t.test( ), that calculates it for us without the need for simulation. However, a nice thing about R is that we can pretty easily replace the t-test with more complex procedures. Any model fitting process that you can program in R can be bundled into a function and applied to pseudorandom simulated data. In the next episode we will go into how to make functions and apply them repeatedly.
Let us start out with a no effect model: 50 observations in two groups drawn from the same distribution. We use the mean and variance of the green control group. This first part just sets up the variables:
mu <- mean(subset(data, group=="green" & treatment=="control")$mass0) sigma <- sd(subset(data, group=="green" & treatment=="control")$mass0) treatment <- c(rep(1, 50), rep(0, 50))
The rnorm( ) function generates numbers from a normal distribution with specified mean and standard deviation. Apart from drawing numbers from it, R can of course pull out various table values, and it knows other distributions as well. Look at the documentation in ?distributions. Finally we perform a t-test. Most of the time, it should not show a significant effect, but sometimes it will.
sim.null <- rnorm(100, mu, sigma) t.test(sim.null ~ treatment)$p.value
We can use the replicate( ) function to evaluate an expression multiple times. We put the simulation and t-test together into one expression, rinse and repeat. Finally, we check how many of the 1000 replicates gave a p-value below 0.05. Of course, it will be approximately 5% of them.
sim.p <- replicate(1000, t.test(rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p < 0.05))/1000
Let us add an effect! Say we’re interested in an effect that we expect to be approximately half the difference between the green and pink comb gnomes:
d <- mean(subset(data, group=="green" & treatment=="control")$mass0) - mean(subset(data, group=="pink" & treatment=="control")$mass0) sim.p.effect <- replicate(1000, t.test(treatment * d/2 + rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p.effect < 0.05))/1000
We see that with 50 individuals in each group and this effect size we will detect a significant difference about 75% of the time. This is the power of the test. If you are able to find nice and trustworthy prior information about the kind of effect sizes and variances you expect to find in a study, design analysis allows you to calculate for instance how big a sample you need to have good power. Simulation can also give you an idea of how badly a statistical procedure will break if the assumptions don’t hold. We can try to simulate a situation where the variances of the two groups differs quite a bit.
sim.unequal <- replicate(1000, t.test(c(rnorm(50, mu, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal < 0.05))/1000
sim.unequal.effect <- replicate(1000, t.test(c(rnorm(50, mu+d/2, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal.effect < 0.05))/1000
In conclusion, the significance is still under control, but the power has dropped to about 40%. I hope that has given a small taste of how simulation can help with figuring out what is going on in our favourite statistical procedures. Have fun!
I am doing a second installment of the lunch seminars about data analysis with R for the members of the Wright lab. It’s pretty much the same material as before — data frames, linear models and some plots with ggplot2 — but I’ve sprinkled in some more exercises during the seminars. I’ve tried emphasising scripting a bit more than last time, and made a lot of use of RStudio. Going through this first part has taken four hours, but that includes each seminar a quick review of what we did last time and lots of questions. Next week we’ll get started on gene expression microarray data, and I’ll try introducing both limma and plyr.
(My previous introduction materials are posted here. Comments, suggestions and ideas about teaching R to biologists are always welcome!)
Occasionally I find myself wanting to draw several regression lines on the same plot, and of course ggplot2 has convenient facilities for this. As usual, don’t expect anything profound from this post, just a quick tip!
There are several reasons we might end up with a table of regression coefficients connecting two variables in different ways. For instance, see the previous post about ordinary and orthogonal regression lines, or as a commenter suggested: quantile regression. I’ve never used quantile regression myself, but another example might be plotting simulations from a regression or multiple regression lines for different combinations of predictors.
Let’s start with a couple of quantile regressions. Ordinary regression compares the mean difference in a response variable between different values of the predictors, while quantile regression models some chosen quantiles of the response variable. The rq function of Roger Koenker’s quantreg package does quantile regression. We extract the coefficient matrix and make a dataframe:
library(quantreg) model.rq <- rq(Temp ~ Wind, airquality, tau=c(0.25, 0.5, 0.75)) quantile.regressions <- data.frame(t(coef(model.rq))) colnames(quantile.regressions) <- c("intercept", "slope") quantile.regressions$quantile <- rownames(quantile.regressions) quantile.regressions
intercept slope quantile tau= 0.25 85.63636 -1.363636 tau= 0.25 tau= 0.50 93.03448 -1.379310 tau= 0.50 tau= 0.75 94.50000 -1.086957 tau= 0.75
The addition of the quantile column is optional if you don’t feel the need to colour the lines.
library(ggplot2) scatterplot <- qplot(x=Wind, y=Temp, data=airquality) scatterplot + geom_abline(aes(intercept=intercept, slope=slope, colour=quantile), data=quantile.regressions)
We use the fact that ggplot2 returns the plot as an object that we can play with and add the regression line layer, supplying not the raw data frame but the data frame of regression coefficients.