## Posts Tagged ‘**ggplot2**’

## Using R: correlation heatmap, take 2

Apparently, this turned out to be my most popular post ever. Of course there are lots of things to say about the heatmap (or quilt, tile, guilt plot etc), but what I wrote was literally just a quick celebratory post to commemorate that I’d finally grasped how to combine reshape2 and ggplot2 to quickly make this colourful picture of a correlation matrix.

However, I realised there is one more thing that is really needed, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. We use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, it’s very easy to get a scale that has a midpoint and a different colour in each direction. It’s called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!

data <- airquality[,1:4] library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(data, use="p")), fill=value, geom="tile") + scale_fill_gradient2(limits=c(-1, 1))

## Books and lessons about ggplot2

I recently got an email from a person at Packt publishing, who suggested I write a book for them about ggplot2. My answer, which is perfectly true, is that I don’t have the time, nor the expertise to do that. What I didn’t say is that 1) a quick web search suggests that Packt doesn’t have the best reputation and 2) there are already two books about ggplot2 that I think covers the entire field: the indispensable ggplot2 book, written by Hadley Wickham, the author of the package, and the R Graphics Cookbok by Wincent Chang. There are too many decent but not great R books on the market already and there is no reason for me to spend time to create another one.

However, there are a few things I’d like to tell the novice, in general, about ggplot2:

1. ggplot2 is not necessarily superior to lattice or base graphics. It’s largely a matter of taste, you can make very nice plots with either plotting system and in some situations everybody will turn away from their favourite system and use another. For instance, a lot of built-in diagnostic plots come preprogrammed in base graphics. The great thing about ggplot2 is how it allows the user to think about plots as layers of mappings between variables and geometric objects. Base graphics and lattice are organised in different ways, which is not necessarily worse; it depends on the way you’re accustomed to thinking about statistical graphics.

2. There are two ways to start a plot in ggplot2: qplot( ) or ggplot( ). qplot is the quicker way and probably the one you should learn first. But don’t be afraid of the ggplot function! All it does is set up an empty plot and connect a data frame to it. After that you can add your layers and map your variables almost the same way you would do with qplot. After you’ve become comfortable with qplot, just try building plots with ggplot a few times, and you’ll see how similar it is.

3. The magic is not in plotting the data but in tidying and rearranging the data for plotting. Make sure to put all your labels, indicators and multiple series of data into the same data frame (most of the time: just cbind or merge them together), melt the data frame and pass it to ggplot2. If you want to layer predictions on top of your data, put output from your model in another data frame. It is perfectly possible, often advisable, to write functions that generate ggplot2 plots, but make sure to always create the data frame to be plotted first and then pass it on. I suggest not trying to create the mappings, that is x= and y= and the like, on the fly. There is always the risk of messing up the order of the vectors, and also, because ggplot2 uses metaprogramming techniques for the aesthetics, you might see unexpected behaviours when putting function calls into the mapping.

4. Worry about mapping variables and facetting first and then change the formatting. Because of how plots as well as settings in ggplot2 are objects that you can store and pass around, you can first create a raw version of the plot, and then just add (yes add, with the ”+” operator) the formatting options you want. So taken together, the workflow for making any ggplot2 plot goes something like this: 1) put your data frame in order; 2) set up a basic plot with the qplot or ggplot function; 3) add one extra geom at at time (optional if you make a simple plot with qplot, since qplot sets up the first geometry for you); 4) add the settings needed to make the plot look good.

Happy ggplotting!

## Using R: Coloured sizeplot with ggplot2

Someone asked about this and I though the solution with ggplot2 was pretty neat. Imagine that you have a scatterplot with some points in the exact same coordinates, and to reduce overplotting you want to have the size of the dot indicating the number of data points that fall on it. At the same time you want to colour the points according to some categorical variable.

The sizeplot function in the plotrix package makes this type of scatterplot. However, it doesn’t do the colouring easily. I’m sure it’s quite possible with a better knowledge of base graphics, but I tend to prefer ggplot2. To construct the same type of plot we need to count the data points. For this, I use table( ), and then melt the contingency table and remove the zeroes.

library(ggplot2) library(reshape2) data <- data.frame(x=c(0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4), y=c(0, 0, 0, 3, 1, 1, 1, 2, 2, 1, 4, 4), group=c(rep(1, 6), rep(2, 4), rep(3, 2))) counts <- melt(table(data[1:2])) colnames(counts) <- c(colnames(data)[1:2], "count") counts <- subset(counts, count != 0) sizeplot <- qplot(x=x, y=y, size=count, data=counts) + scale_size(range=c(5, 10))

This is the first sizeplot. (The original scale makes single points very tiny. Hence the custom scale for size. Play with the range values to taste!) To add colour, we merge the counts with the original data to get back the group information — and, in true ggplot2 fashion, map the group variable to colour.

counts.and.groups <- merge(counts, unique(data)) sizeplot.colour <- qplot(x=x, y=y, size=count, colour=factor(group), data=counts.and.groups) + scale_size(range=c(5, 10))

One thing that this simple script does not handle well is if points that should have different colour happen to overlap. (As it stands, this code will actually plot two points both the size of the total number of overlapping points in different colours on top of each other. That must be wrong in several ways.) However, I don’t know what would be the best behaviour in this instance. Maybe to count the number of overlaps separately and plot both points while adding some transparency to the points?

## A slightly different introduction to R, part V: plotting and simulating linear models

In the last episode (which was quite some time ago) we looked into comparisons of means with linear models. This time, let’s visualise some linear models with ggplot2, and practice another useful R skill, namely how to simulate data from known models. While doing this, we’ll learn some more about the layered structure of a ggplot2 plot, and some useful thing about the lm function.

### 11. Using points, lines and error bars to show predictions from linear models

Return to the model of comb gnome mass at time zero. We’ve already plotted the coefficient estimates, but let us just look at them with the coef() function. Here the intercept term is the mean for green comb gnomes subjected to the control treatment. The ‘grouppink’ and ‘treatmentpixies’ coefficients are the mean differences of pink comb gnomes and comb gnomes exposed to pixies from this baseline condition. This way of assigning coefficients is called dummy coding and is the default in R.

model <- lm(mass0 ~ group + treatment, data) coef(model)[1]

(Intercept) grouppink treatmentpixies 141.56771 -49.75414 23.52428

The estimate for a pink comb gnome with pixies is:

coef(model)[1] + coef(model)[2] + coef(model)[3]

There are alternative codings (”contrasts”) that you can use. A common one in Anova is to use the intercept as the grand mean and the coefficients as deviations from the mean. (So that the coefficients for different levels of the same factor sum to zero.) We can get this setting in R by changing the contrasts option, and then rerun the model. However, whether the coefficients are easily interpretable or not, they still lead to the same means, and we can always calculate the values of the combinations of levels that interest us.

Instead of typing in the formulas ourself as above, we can get predictions from the model with the predict( ) function. We need a data frame of the new values to predict, which in this case means one row for each combination of the levels of group and treatment. Since we have too levels each there are only for of them, but in general we can use the expand.grid( ) function to generate all possible factor levels. We’ll then get the predictions and their confidence intervals, and bundle everything together to one handy data frame.

levels <- expand.grid(group=c("green", "pink"), treatment=c("control", "pixies")) predictions <- predict(model, levels, interval="confidence") predicted.data <- cbind(levels, predictions)

group treatment fit lwr upr 1 green control 141.56771 125.82527 157.3101 2 pink control 91.81357 76.48329 107.1439 3 green pixies 165.09199 149.34955 180.8344 4 pink pixies 115.33785 98.93425 131.7414

Now that we have these intervals in a data frame we can plot them just like we would any other values. Back in part II, we put several categorical variables into the same plot by colouring the points. Now, let’s introduce nice feature of ggplot2: making small multiples with faceting. qplot( ) takes facets argument which is a formula where the left hand side, before the tilde (‘~’), will be used to split the plot vertically, and the right hand side will split the plot horizontally. In this case, we split horizontally, each panel representing one level of the treatment variable. Also, we use a new geometry: pointrange, which draws a point with bars above and below it and is quite suitable for the intervals we’ve got.

qplot(x=treatment, facets=~group, y=fit, ymax=upr, ymin=lwr geom="pointrange", data=predicted.data)

That’s good, but combining the predictions from the model and the actual data in the same plot would be nice. In ggplot2, every plot is an object that can be saved away to a variable. Then we can use the addition operator to add layers to the plot. Let’s make a jittered dotplot like the above and then add a layer with the pointrange geometry displaying confidence intervals. The scatter of the data points around the confidence intervals reminds us that there is quite a bit of residual variance. The coefficient of determination, as seen in the summary earlier, was about 0.25.

qplot(x=treatment, y=mass0, facets=~group, geom="jitter", data=data) + geom_pointrange(aes(y=fit, ymax=upr, ymin=lwr), colour="red", data=predicted.data)

In the above, we make use of ggplot2’s more advanced syntax for specifying plots. The addition operator adds layers. The first layer can be set up with qplot(), but the following layers are made with their respective functions. Mapping from variables to features of the plot, called aesthetics, have to be put inside the aes() function. This might look a bit weird in the beginning, but it has its internal logic — all this is described in Hadley Wickham’s ggplot2 book.

We should probably try a regression line as well. The abline geometry allows us to plot a line with given intercept and slope, i.e. the coefficients of a simple regression. Let us simplify a little and look at the mass at time zero and the log-transformed mass at time 50 in only the green group. We make a linear model that uses the same slope for both treatments and a treatment-specific intercept. (Exercise for the reader: look at the coefficients with coef( ) and verify that I’ve pulled out the intercepts and slope correctly.) Finally, we plot the points with qplot and add the lines one layer at the time.

green.data <- subset(data, group=="green") model.green <- lm(log(mass50) ~ mass0 + treatment, green.data) intercept.control <- coef(model.green)[1] intercept.pixies <- coef(model.green)[1]+coef(model.green)[3] qplot(x=mass0, y=log(mass50), colour=treatment, data=green.data) + geom_abline(intercept=intercept.pixies, slope=coef(model.green)[2]) + geom_abline(intercept=intercept.control, slope=coef(model.green)[2])

### 12. Using pseudorandom numbers for sanity checking

There is a short step from playing with regression functions that we’ve fitted, like we did above, to making up hypothetical regression functions and simulating data from them. This type of fake-data simulation is very useful to for testing how designs and estimation procedures behave and check things like the control of false positive rate and the power to accurately estimate a known model.

The model will be the simplest possible: a single categorical predictor with only two levels and normally distributed equal error variance, i.e. a t-test. There is a formula for the power of the t-test and an R function, power.t.test( ), that calculates it for us without the need for simulation. However, a nice thing about R is that we can pretty easily replace the t-test with more complex procedures. Any model fitting process that you can program in R can be bundled into a function and applied to pseudorandom simulated data. In the next episode we will go into how to make functions and apply them repeatedly.

Let us start out with a no effect model: 50 observations in two groups drawn from the same distribution. We use the mean and variance of the green control group. This first part just sets up the variables:

mu <- mean(subset(data, group=="green" & treatment=="control")$mass0) sigma <- sd(subset(data, group=="green" & treatment=="control")$mass0) treatment <- c(rep(1, 50), rep(0, 50))

The rnorm( ) function generates numbers from a normal distribution with specified mean and standard deviation. Apart from drawing numbers from it, R can of course pull out various table values, and it knows other distributions as well. Look at the documentation in ?distributions. Finally we perform a t-test. Most of the time, it should not show a significant effect, but sometimes it will.

sim.null <- rnorm(100, mu, sigma) t.test(sim.null ~ treatment)$p.value

We can use the replicate( ) function to evaluate an expression multiple times. We put the simulation and t-test together into one expression, rinse and repeat. Finally, we check how many of the 1000 replicates gave a p-value below 0.05. Of course, it will be approximately 5% of them.

sim.p <- replicate(1000, t.test(rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p < 0.05))/1000

[1] 0.047

Let us add an effect! Say we’re interested in an effect that we expect to be approximately half the difference between the green and pink comb gnomes:

d <- mean(subset(data, group=="green" & treatment=="control")$mass0) - mean(subset(data, group=="pink" & treatment=="control")$mass0) sim.p.effect <- replicate(1000, t.test(treatment * d/2 + rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p.effect < 0.05))/1000

[1] 0.737

We see that with 50 individuals in each group and this effect size we will detect a significant difference about 75% of the time. This is the power of the test. If you are able to find nice and trustworthy prior information about the kind of effect sizes and variances you expect to find in a study, design analysis allows you to calculate for instance how big a sample you need to have good power. Simulation can also give you an idea of how badly a statistical procedure will break if the assumptions don’t hold. We can try to simulate a situation where the variances of the two groups differs quite a bit.

sim.unequal <- replicate(1000, t.test(c(rnorm(50, mu, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal < 0.05))/1000

[1] 0.043

sim.unequal.effect <- replicate(1000, t.test(c(rnorm(50, mu+d/2, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal.effect < 0.05))/1000

[1] 0.373

In conclusion, the significance is still under control, but the power has dropped to about 40%. I hope that has given a small taste of how simulation can help with figuring out what is going on in our favourite statistical procedures. Have fun!

## R intro seminars, take 2: some slides about data frames, linear models and statistical graphics

I am doing a second installment of the lunch seminars about data analysis with R for the members of the Wright lab. It’s pretty much the same material as before — data frames, linear models and some plots with ggplot2 — but I’ve sprinkled in some more exercises during the seminars. I’ve tried emphasising scripting a bit more than last time, and made a lot of use of RStudio. Going through this first part has taken four hours, but that includes each seminar a quick review of what we did last time and lots of questions. Next week we’ll get started on gene expression microarray data, and I’ll try introducing both *limma* and *plyr*.

(My previous introduction materials are posted here. Comments, suggestions and ideas about teaching R to biologists are always welcome!)

## Using R: drawing several regression lines with ggplot2

Occasionally I find myself wanting to draw several regression lines on the same plot, and of course ggplot2 has convenient facilities for this. As usual, don’t expect anything profound from this post, just a quick tip!

There are several reasons we might end up with a table of regression coefficients connecting two variables in different ways. For instance, see the previous post about ordinary and orthogonal regression lines, or as a commenter suggested: quantile regression. I’ve never used quantile regression myself, but another example might be plotting simulations from a regression or multiple regression lines for different combinations of predictors.

Let’s start with a couple of quantile regressions. Ordinary regression compares the mean difference in a response variable between different values of the predictors, while quantile regression models some chosen quantiles of the response variable. The rq function of Roger Koenker’s quantreg package does quantile regression. We extract the coefficient matrix and make a dataframe:

library(quantreg) model.rq <- rq(Temp ~ Wind, airquality, tau=c(0.25, 0.5, 0.75)) quantile.regressions <- data.frame(t(coef(model.rq))) colnames(quantile.regressions) <- c("intercept", "slope") quantile.regressions$quantile <- rownames(quantile.regressions) quantile.regressions

intercept slope quantile tau= 0.25 85.63636 -1.363636 tau= 0.25 tau= 0.50 93.03448 -1.379310 tau= 0.50 tau= 0.75 94.50000 -1.086957 tau= 0.75

The addition of the quantile column is optional if you don’t feel the need to colour the lines.

library(ggplot2) scatterplot <- qplot(x=Wind, y=Temp, data=airquality) scatterplot + geom_abline(aes(intercept=intercept, slope=slope, colour=quantile), data=quantile.regressions)

We use the fact that ggplot2 returns the plot as an object that we can play with and add the regression line layer, supplying not the raw data frame but the data frame of regression coefficients.

## ”How to draw the line” with ggplot2

In a recent tutorial in the eLife journal, Huang, Rattner, Liu & Nathans suggested that researchers who draw scatterplots should start providing not one but three regression lines. I quote,

Plotting both regression lines gives a fuller picture of the data, and comparing their slopes provides a simple graphical assessment of the correlation coefficient. Plotting the orthogonal regression line (red) provides additional information because it makes no assumptions about the dependence or independence of the variables; as such, it appears to more accurately describe the trend in the data compared to either of the ordinary least squares regression lines.

Not that new, but I do love a good scatterplot, so I decided to try drawing some lines. I use the temperature and wind variables in the air quality data set (NA values removed). We will need Hadley Wickham’s ggplot2 and Bendix Carstensen’s, Lyle Gurrin’s and Claus Ekstrom’s MethComp.

library(ggplot2) library(MethComp) data(airquality) data <- na.exclude(airquality)

Let’s first make the regular old scatterplot with a regression (temperature as response; wind as predictor):

plot.y <- qplot(y=Temp, x=Wind, data=data) model.y <- lm(Temp ~ Wind, data) coef.y <- coef(model.y) plot.y + geom_abline(intercept=coef.y[1], slope=coef.y[2])

And then, a regular old scatterplot of the other regression:

plot.x <- qplot(y=Wind, x=Temp, data=data) model.x <- lm(Wind ~ Temp, data) coef(model.x) plot.x + geom_abline(intercept=coef(model.x)[1], slope=coef(model.x)[2])

So far, everything is normal. To put both lines in the same plot, we’ll need to rearrange the coefficients a little. From the above regression we get the equation *x = a + by*, and we rearrange it to *y = – a / b + (1 / b) x*.

rearrange.coef <- function(coef) { alpha <- coef[1] beta <- coef[2] new.coef <- c(-alpha/beta, 1/beta) names(new.coef) <- c("intercept", "slope") return(new.coef) } coef.x <- rearrange.coef(coef(model.x))

The third regression line is different: orthogonal, total least squares or Deming regression. There is a function for that in the MethComp package.

deming <- Deming(y=airquality$Temp, x=airquality$Wind) deming

We can even use the rearrange.coef function above to see that the coefficients of Deming regression does not depend on which variable is taken as the response or predictor:

rearrange.coef(deming)

intercept slope 24.8083259 -0.1906826

Deming(y=airquality$Wind, x=airquality$Temp)[1:2]

Intercept Slope 24.8083259 -0.1906826

So, here is the final plot with all three lines:

plot.y + geom_abline(intercept=coef.y[1], slope=coef.y[2], colour="red") + geom_abline(intercept=coef.x[1], slope=coef.x[2], colour="blue") + geom_abline(intercept=deming[1], slope=deming[2], colour="purple")

Now for the dénouement of this post. Of course, there’s already an R function for doing this. It’s even in the same package as I used for the Deming regression; I just didn’t immediately rtfm. It uses base R graphics, though, and I tend to prefer ggplot2:

plot(x=data$Wind, y=data$Temp) bothlines(x=data$Wind, y=data$Temp, Dem=T, col=c("red", "blue", "purple"))

This is the resulting plot (and one can even check out the source code of bothlines and see that I did my algebra correctly).

In this case, as well as in a few others that I tried, it seems like one of the ordinary regression lines (in this case the one with wind as response and temperature as predictor) is much closer to the Deming regression line than the other. I wonder under what circumstances that is the case, and if it tells anything useful about the variables. I welcome any thoughts on this matter from you, dear reader.

**Literature**

Huang L, Rattner A, Liu H, Nathans J. (2013) Tutorial: How to draw the line in biomedical research. eLife e00638 doi:10.7554/eLife.00638

## Slides and exercise from my second R intro seminar

This week I held the second introductory seminar on R, and I think it went pretty well — though I guess you really should ask my colleagues if you want to know. The first seminar was a lecture, and this seminar was a tutorial where we made some plots and calculated a few of the usual statistics. Of course the only real way to learn R is to play with it, but I hope this couple of hours provided a decent opening to getting started with R.

I actually think RStudio made it quite a bit easier. One important thing that I certainly should’ve stressed more, though, is organising code into scripts. I mentioned it in the first seminar, but I should have included it into the exercise. Maybe the first section should be something like ”Start a new script file for your analysis: Select File > New > R script to open the editor area. Save the script as unicorn_analysis.R, and for the rest of the tutorial, write your code in that window.”

Files from the seminar:

Slides about common statistical functions in R (again, ugly walls of text meant more as notes for future reference than actual slides)

Exercises for the tutorial and the associated dataset

My suggested solutions to the exercises

## Using R: Correlation heatmap with ggplot2

Just a short post to celebrate that I learned today how incredibly easy it is to make a heatmap of correlations with ggplot2 (and reshape2, of course).

data(attitude) library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(attitude)), fill=value, geom="tile")

So, what is going on in that short passage? cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones). melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value. We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.