Inlägg taggade ‘ggplot2’
This week I held the second introductory seminar on R, and I think it went pretty well — though I guess you really should ask my colleagues if you want to know. The first seminar was a lecture, and this seminar was a tutorial where we made some plots and calculated a few of the usual statistics. Of course the only real way to learn R is to play with it, but I hope this couple of hours provided a decent opening to getting started with R.
I actually think RStudio made it quite a bit easier. One important thing that I certainly should’ve stressed more, though, is organising code into scripts. I mentioned it in the first seminar, but I should have included it into the exercise. Maybe the first section should be something like ”Start a new script file for your analysis: Select File > New > R script to open the editor area. Save the script as unicorn_analysis.R, and for the rest of the tutorial, write your code in that window.”
Files from the seminar:
Slides about common statistical functions in R (again, ugly walls of text meant more as notes for future reference than actual slides)
My suggested solutions to the exercises
Just a short post to celebrate that I learned today how incredibly easy it is to make a heatmap of correlations with ggplot2 (and reshape2, of course).
data(attitude) library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(attitude)), fill=value, geom="tile")
So, what is going on in that short passage? cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones). melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value. We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.
Now, after reading in data, making plots and organising commands with scripts and Sweave, we’re ready to do some numerical data analysis. If you’re following this introduction, you’ve probably been waiting for this moment, but I really think it’s a good idea to start with graphics and scripting before statistical calculations.
We’ll use the silly comb gnome dataset again. If you saved an Rdata file in part II, you can load it with
If not, you can run this. Remind yourself what the changes to the melted data mean:
data <- read.csv("comb_gnome_data.csv") library(reshape2) melted <- melt(data, id.vars=c("id", "group", "treatment")) melted$time <- 0 melted$time[which(melted$variable=="mass10"] <- 10 melted$time[which(melted$variable=="mass25")] <- 25 melted$time[which(melted$variable=="mass50")] <- 50 melted$id <- factor(melted$id)
We’ve already looked at some plots and figured out that there looks to be substantial differences in mass between the green and pink groups, and the control versus treatment. Let’s try to substantiate that with some statistics.
9. Mean and covariance
Just like anything in R, statistical tools are functions. Some of them come in special packages, but base R can do a lot of stuff out of the box.
Comparsion of two means: We’ve already gotten means from the mean() function and from summary(). Variance and standard deviation are calculated with var() and sd() respectively. Comparing the means betweeen two groups with a t-test or a Wilcoxon-Mann-Whitney test is done with t.test() and wilcox.test(). The functions have the word test in their names, but t-test gives not only the test statistics and p-values, but also estimates and confidence intervals. The parameters are two vectors of values of each group (i.e. a column from the subset of a data frame), and some options.
Looking back at this plot, I guess no-one is surprised by a difference in birthweigh between pink and green comb gnomes:
t.test(subset(data, group=="pink")$mass0, subset(data, group=="green")$mass0)
Welch Two Sample t-test data: subset(data, group == "pink")$mass0 and subset(data, group == "green")$mass0 t = -5.397, df = 96.821, p-value = 4.814e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -69.69294 -32.21577 sample estimates: mean of x mean of y 102.3755 153.3298
That is, we feed in two pieces of data (two vectors, really, which is what you get pulling out a column from a data frame). The above is the typical situation when you have all data points in one column and a group indicator in another. Hence you begin by subsetting the data frame to get the right rows, and pull out the right columns with the $. t.test also does paired tests, with the additional parameter paired=T.
wilcox.test(subset(data, group=="pink")$mass50, subset(data, group=="green")$mass50)
Wilcoxon rank sum test with continuity correction data: subset(data, group == "pink")$mass50 and subset(data, group == "green")$mass50 W = 605, p-value = 1.454e-05 alternative hypothesis: true location shift is not equal to 0
Recalling histograms for the comb gnome weights, the use of the Wilcoxon-Mann-Whitney for masses at tim 50 and a t-test for the masses at birth (t=0) probably makes sense. However, we probably want to make use of all the time points together rather than doing a test for each time point, and we also want to deal with both the colour and the treatment at the same time.
Before we get there, let’s look at correlation:
cor(data$mass10, data$mass25) cor(data$mass0, data$mass50, method="spearman") cor.test(data$mass10, data$mass25)
The cor() function gives you correlation coefficients, both Pearson, Spearman and Kendall. If you want the covariance, cov() is the function for that. cor.test() does associated tests and confidence intervals. One thing to keep in mind is missing values. This data set is complete, but try this:
some.missing <- data$mass0 some.missing[c(10, 20, 30, 40:50)] <- NA cor(some.missing, data$mass25) cor(some.missing, data$mass10, use="pairwise")
The use parameter decides what values R should include. The default is all, but we can choose pairwise complete observations instead.
If you have a big table of variables that you’d like to correlate with each other, the cor() function works for them as well. (Not cor.test(), though. However, the function can be applied across the rows of a data frame. We’ll return to that.)
10. A couple of simple linear models
Honestly, most of the statistics in biology is simply linear models fit with least squares and tested with a normal error model. A linear model looks like this
yi = b0 + b1x1i + b2x2i + … bnxni + ei
where y is the response variable, the xs are predictors, i is an index over the data points, and ei are the errors. The error is the only part of the equations that is a random variable. b0, …, bn are the coefficients — your main result, showing how the mean difference in the response variable between data points with different values of the predictors. The coefficients are fit by least squares, and by estimating the variance of the error term, we can get some idea of the uncertainty in the coefficients.
Regression coefficients can be interpreted as predictions about future values or sometimes even as causal claims (depending on other assumptions), but basically, they describe differences in mean values.
This is not a text on linear regression — there are many of those; may I suggest the books by Faraway or Gelman and Hill — suffice to say that as long as the errors are independent and have equal variance, least squares is the best unbiased estimate. If we also assume that the errors are normally distributed, the least squares is also the maximum likelihood estimate. (And it’s essentially the same as a Bayesian version of the linear model with vague priors, just for the record.)
In R, the lm() function handles linear models. The model is entered as a formula of the type response ~ predictor + predictor * interacting predictors. The error is implicit, and assumed to be normally distributed.
model <- lm(mass0 ~ group + treatment, data=data) summary(model)
Call: lm(formula = mass0 ~ group + treatment, data = data) Residuals: Min 1Q Median 3Q Max -86.220 -32.366 -2.847 35.445 98.417 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 141.568 7.931 17.850 < 2e-16 *** grouppink -49.754 9.193 -5.412 4.57e-07 *** treatmentpixies 23.524 9.204 2.556 0.0122 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 45.67 on 96 degrees of freedom Multiple R-squared: 0.28, Adjusted R-squared: 0.265 F-statistic: 18.67 on 2 and 96 DF, p-value: 1.418e-07
The summary gives the coefficients, their standard errors, the p-value of a t-test of the regression coefficient, and R squared for the model. Factors are encoded as dummy variables, and R has picked the green group and the control treatment as baseline so the coefficient ”grouppink” describes how the mean of the pink group differs from the green. Here are the corresponding confidence intervals:
2.5 % 97.5 % (Intercept) 125.825271 157.31015 grouppink -68.001759 -31.50652 treatmentpixies 5.254271 41.79428
(These confidence intervals are not adjusted to control the family-wise error rate, though.) With only two factors, the above table is not that hard to read, but let’s show a graphical summary. Jared Lander’s coefplot gives us a summary of the coefficients:
install.packages("coefplot") ##only the first time library(coefplot) coefplot(model)
The bars are 2 standard deviations. This kind of plot gives us a quick look at the coefficients, and whether they are far from zero (and therefore statistically significant). It is probably more useful for models with many coefficients.
There is a bunch of diagnostic plots that you can make to check for gross violations of the above assumptions of the linear model. Two useful ones are the normal quantile-quantile plot of residuals, and the residuals versus fitted scatterplot:
library(ggplot2) qplot(sample=residuals(model), stat="qq")
The quantile plot compares the distribution of the residual to the quantiles of a normal distribution — if the residuals are normally distributed it will be a straight line.
Variance should be roughly equal fitted values, and there should not be obvious patterns in the data.
If these plots look terrible, a common approach is to try to find a transformation of the data that allows the linear model to be used anyway. For instance, it often helps to take the logarithm of the response variable. Why is that so useful? Well, with some algebraic magic:
log(yi) = b0 + b1x1i + b2x2i + … + bnxni + ei, and as long as no y:s are zero,
yi = exp(b0) * exp(b1x1i) * exp(b2x2i) * .. * exp(bnxni) * exp(ei)
We have gone from a linear model to a model where the b:s and x:es multiplied together. For some types of data, this will stabilise the variance of the errors, and make the distribution closer to a normal distribution. It’s by no means a panacea, but in the comb gnome case, I hope the plots we made in part II have already convinced you that an exponential function might be involved.
Let’s look at a model where these plots look truly terrible: the weight at time 50.
model.50 <- lm(mass50 ~ group + treatment, data=data) qplot(sample=residuals(model.50), stat="qq") qplot(fitted(model.50), residuals(model.50))
Let’s try the log transform:
model.log.50 <- lm(log(mass50) ~ group + treatment, data=data) qplot(sample=residuals(model.log.50), stat="qq") qplot(fitted(model.log.50), residuals(model.log.50)) coefplot(model.log.50)
In both the above models both predictors are categorical. When dealing with categorical predictors, you might prefer the analysis of variance formalism. Anova is the same kind of linear model as regression (but usually parameterised slightly differently), followed by F-tests to check whether each predictor explains a significant amount of the variance in the response variable. To see the Anova table for a linear model in R, do this:
comb.gnome.anova <- aov(log(mass50) ~ group + treatment, data=data) summary(comb.gnome.anova)
Df Sum Sq Mean Sq F value Pr(>F) group 1 35.87 35.87 73.26 1.84e-13 *** treatment 1 90.76 90.76 185.36 < 2e-16 *** Residuals 96 47.00 0.49 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In part I, we looked at importing data into R and simple ways to manipulate data frames. Once we’ve gotten our data safely into R, the first thing we want to do is probably to make some plots.
Below, we’ll make some simple plots of the made-up comb gnome data. If you want to play along, load the same file we used for part I.
data <- read.csv("comb_gnome_data.csv")
This table contains masses of 99 comb gnomes at four time points. Gnomes come in two varieties, and are divided into two groups with exposed to different treatment. Of course, I already know what kinds of effects and features are present in the data, since I generated them. However, please bear with me while we explore a little. The point is to show some useful things R can do for you.
4. Making simple plots with ggplot2
R is often used for graphics, with R plots gracing poster sessions all over the world with their presence. Base R can make pretty nice plots — mainly with the plot() function — but there are also several additional graphics packages. For this introduction, I will use ggplot2 by Hadley Wickham, because it’s my favourite and if you start learning one of them it might as well be ggplot2.
ggplot2 has a rather special syntax that allows a lot of fancy things, but for now we will stick to using the quick plot function, qplot(), which works pretty similar to base R graphics.
Before we can use the package, though, we need to install it. To install a package from CRAN (the comprehensive R archive network), simply give an install.packages() command:
A window opens (by the way, it is beyond me why R feels the need to open a window for this task, since it almost never does, but sure why not) that asks you to select a mirror. Choose the one closest to you. The package downloads, installs, and also installs some other packages that ggplot2 needs.
CRAN, by the way, contains binaries and source for lots and lots of packages. Bioinformatics packages, though, are usually installed through Bioconductor, which has its own system. Some packages live on R Forge, a community platform. You can also install packages manually. Remember though, R packages are software, so don’t just install unknown stuff.
We need to load a package to use it. The next time you use ggplot2, you only need to do this step:
Let’s get on with it. This is how to quickly make a scatterplot, a boxplot and a histogram. The graphs will open in a new window, or in your plot if you’re using Rstudio.
qplot(x=mass0, y=mass50, data=data)
qplot(x=group, y=mass0, geom="boxplot", data=data)
The x and y parameters of course refer to the x and y axis of the graph. geom is the parameter for selecting a particular type of graph. See the ggplot2 documentation for all available geoms. If you don’t specify a geom, ggplot2 will guess it based on what types of variables you supply.
The width of bins of a histogram can be problematic — the histogram may look rather different with different bins. Putting a binwidth parameter in qplot() allows you to change the bins.
The above plots suggest that pink and green comb gnomes differ in weight at time 0. A nice alternative to the boxplot is the dotplot (jittered to reduce overplotting). We make one with mass at time 50 by treatment:
qplot(x=treatment, y=mass50, geom="jitter", data=data)
An alternative to the histogram is a density plot:
qplot(x=mass50, geom="density", data=data)
Both the dotplot and the density plot show that the distribution of comb gnome masses at time 50 is skewed to the left, with many individuals with low mass, and a few very heavy ones.
Some of the more useful parameters to qplot() are xlab, ylab, xlim, ylim, and main. They allow you to set the title of the graph (main), the x and y axis labels (xlab and ylab), and adjust the scales (xlim and ylim). In this example changing the scales make little sense, but it is sometimes useful. (These are the same parameters as you would use in the base R plot() fuction, by the way.)
qplot(x=mass0, y=mass50, data=data, xlim=c(0,350), ylim=c(0,10000), xlab="mass at birth", ylab="mass at age 50", main="Comb gnome mass")
Some style options are very easy to change, and can help visualising structure of data. Here, we use the colour of the dots for treatment, and the shape for group.
qplot(x=mass0, y=mass50, colour=treatment, shape=group, data=data)
This plot, as well as the above jittered dotplot makes it perfectly clear that there is a difference in growth between comb gnomes that are exposed to pixietails and those who are not, but also huge variation within the group of exposed comb gnomes. We can also discern the difference in initial mass between green and pink comb gnomes.
5. Reshaping data, and some more plots
Now, let’s take the opportunity to introduce another wonderful package by Hadley Wickham: reshape2. It is best described with an example. reshape2 should have been installed with ggplot2. Load it like so:
The reshape2 package centers around the concept of ‘melting’ and ‘casting’ a data frame (or array). First, recall the form our data frame is in:
group treatment mass0 mass10 mass25 mass50 id 1 green control 180.9564 217.1795 285.5527 450.6075 1 2 green pixies 140.1157 279.0560 784.3293 4390.4606 2 3 green control 119.0070 125.7097 136.4782 156.5143 3 4 green pixies 220.7838 366.5834 784.2983 2786.0915 4 5 green pixies 210.7911 430.9156 1259.5120 7525.7961 5 6 green control 200.7478 249.3120 345.0496 593.0786 6
Then, look at this type of table:
group treatment variable value 1 green control mass0 180.9564 2 green pixies mass0 140.1157 3 green control mass0 119.0070 4 green pixies mass0 220.7838 5 green pixies mass0 210.7911 6 green control mass0 200.7478
The second format is the what reshape2 and ggplot2 calls ‘melted’ — each row is one value with a few indicator variables (id variables) that tell you which groups a data point belongs to. The ‘variable’ column contains the name of the columns of data — in this case masses at time 0, 10, 25, and 50, as well as id, which reshape2 guessed is one of the data columns; we’ll have to fix that below — and ‘value’ their values.
Imagine making a multiple regression — this is a bit like the format you would want then. The value would be the response, and the id variables would be predictors. This is very useful not only for regression modelling, but also for plots. Look at this one for instance, that comes quite naturally once the data have been melted.
melted <- melt(data, id.vars=c("id", "group", "treatment"))
qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)
We could of course show them as boxplots instead, with geom=”boxplot”. If time was really a categorical variable with four levels, that would be great. I don’t know about you, however, but I would like to see individual growth trajectories of the comb gnomes. To accomplish this, we make a new column in the melted data frame.
melted$time <- 0
It’s about time to introduce the which() function. which() gives you the indices of the elements of a vector that satisfy some logical expression. We get the indices of elements that belong to mass10, mass25 and mass50 and use them to assign the proper number to those elements of the time column.
(Note the ”==” here, used to denote equality in logical expressions. The logical equals sign must be double, or bad things will happen — i.e. R will understand the expression as an assignment rather than a comparision.)
melted$time[which(melted$variable=="mass10")] <- 10 melted$time[which(melted$variable=="mass25")] <- 25 melted$time[which(melted$variable=="mass50")] <- 50
Since the comb gnome ids are numbers, R imported them as a numeric column. We will need the id column shortly, so let’s make sure it is treated as categorical:
melted$id <- factor(melted$id)
We’re almost there. Looking at previous plots, comb gnome growth looks a lot like an exponential function. So let’s try taking the natural logarithm. If you don’t want to change your data frame, functions can be applied in the qplot() call:
> qplot(x=time, y=log(value), geom=”line”, color=id, data=melted)
This is a busy plot, and the legend doesn’t help. You can remove it with the following line. (This is a taste of the finer details of ggplot2. Actually, the output of qplot() is an object that can be manipulated with the addition operator.)
qplot(x=time, y=log(value), geom="line", color=id, data=melted) + theme(legend.position="none")
Anyway, there seems to be method to the madness — the growth of each comb gnome appears roughly linear on a log scale.
We’re done with that for now. I hope this gave a taste of the usefulness of melting data. Even better, once you have melted data, you can ‘cast’ it into some other tabular format. This is particularly useful for data that are provided in a ‘melted’ format, when you really want a cross-table.
We do this with dcast(). The ‘d’ stands for data frame. (There is also an acast() for arrays.) dcast() needs a melted data frame and a formula, consisting of the variables you want in rows, a tilde ”~”, and the variables you want in columns. Try these:
head(dcast(melted, variable ~ id)) head(dcast(melted, id ~ variable)) head(dcast(melted, id + group + treatment ~ variable))
In cases where the variables on one side do not uniquely identify data points, you can use dcast() for data summaries, by choosing some appropriate aggregation function:
dcast(melted, group + treatment ~ variable, fun.aggregate=mean) dcast(melted, group + treatment ~ variable, fun.aggregate=sum)
Overall, reshape2 can save you a lot of time and headache, even though it’s not always completely intuitive. It’s not something you will use every day, but keep it in mind when the problem arises.
6. Saving data
Before we move any further, let’s talk about saving data:
1. Saving graphics. Depending on what platform you work on, there will be different interface features to save the plot you’re looking at. In the windows R interface, you can right click to get a menu that allows you to copy the graph. On mac OS X, you can save the graph with an option in the menu bar. In RStudio, you can press the export button above the plot area. On windows and unix, you can write
with several image types. See help file for details. On any platform, you can redirect the graphics from the screen to a file, run your plot commands and then send control back to the screen:
pdf("some_filename.pdf") qplot(x=mass10, y=mass25, data=data) dev.off()
2. Saving tables. If you have modified a data frame and want to save it to a file (for instance, you might like to use R to melt or cast a table for some other software that needs a particular format), you can use write.table() or write.csv().
write.csv(melted, file="melted_data.csv", row.names=F, quote=F)
3. Saving R objects. This is useful when you want to switch between scripts and sessions. Just write
save(data, melted, file="comb.gnome.data.Rdata")
to save the content of variables ”data” and ”melted”. You can list several variables or just one. If you want to continue with the next part, save the melted data to a file like that. When you reopen R, you can load them with:
4. Saving the entire workspace. You can also save the entire contents of your workspace: all the variables you currently have assigned and the history of commands that you’ve given. But frankly, I don’t recommend it. Saving your workspace means that you can pick up exactly where you left off, but the workspace quickly turns into a mess of forgotten variables, and it’s very difficult to retrace the train of thought in an unannotated command history.
I much prefer saving selected variables on files, and making each analysis a self-contained script. In the next part, I’ll try to show you how. Now that we’ve covered a few useful thing one can do with R, it is time to have a look at how to organise R code into simple scripts, and introduce Sweave.