# There is grandeur in this view of life

martins bioblogg

## Using R: Coloured sizeplot with ggplot2

med 8 kommentarer

Someone asked about this and I though the solution with ggplot2 was pretty neat. Imagine that you have a scatterplot with some points in the exact same coordinates, and to reduce overplotting you want to have the size of the dot indicating the number of data points that fall on it. At the same time you want to colour the points according to some categorical variable.

The sizeplot function in the plotrix package makes this type of scatterplot. However, it doesn’t do the colouring easily. I’m sure it’s quite possible with a better knowledge of base graphics, but I tend to prefer ggplot2. To construct the same type of plot we need to count the data points. For this, I use table( ), and then melt the contingency table and remove the zeroes.

```library(ggplot2)
library(reshape2)
data <- data.frame(x=c(0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4),
y=c(0, 0, 0, 3, 1, 1, 1, 2, 2, 1, 4, 4),
group=c(rep(1, 6), rep(2, 4), rep(3, 2)))
counts <- melt(table(data[1:2]))
colnames(counts) <- c(colnames(data)[1:2], "count")
counts <- subset(counts, count != 0)
sizeplot <- qplot(x=x, y=y, size=count, data=counts) + scale_size(range=c(5, 10))
```

This is the first sizeplot. (The original scale makes single points very tiny. Hence the custom scale for size. Play with the range values to taste!) To add colour, we merge the counts with the original data to get back the group information — and, in true ggplot2 fashion, map the group variable to colour.

```counts.and.groups <- merge(counts, unique(data))
sizeplot.colour <- qplot(x=x, y=y, size=count,
colour=factor(group), data=counts.and.groups) +
scale_size(range=c(5, 10))
```

One thing that this simple script does not handle well is if points that should have different colour happen to overlap. (As it stands, this code will actually plot two points both the size of the total number of overlapping points in different colours on top of each other. That must be wrong in several ways.) However, I don’t know what would be the best behaviour in this instance. Maybe to count the number of overlaps separately and plot both points while adding some transparency to the points?

Skrivet av mrtnj

17 november, 2013 vid 20:02

## A slightly different introduction to R, part V: plotting and simulating linear models

med 3 kommentarer

In the last episode (which was quite some time ago) we looked into comparisons of means with linear models. This time, let’s visualise some linear models with ggplot2, and practice another useful R skill, namely how to simulate data from known models. While doing this, we’ll learn some more about the layered structure of a ggplot2 plot, and some useful thing about the lm function.

### 11. Using points, lines and error bars to show predictions from linear models

Return to the model of comb gnome mass at time zero. We’ve already plotted the coefficient estimates, but let us just look at them with the coef() function. Here the intercept term is the mean for green comb gnomes subjected to the control treatment. The ‘grouppink’ and ‘treatmentpixies’ coefficients are the mean differences of pink comb gnomes and comb gnomes exposed to pixies from this baseline condition. This way of assigning coefficients is called dummy coding and is the default in R.

```model <- lm(mass0 ~ group + treatment, data)
coef(model)[1]
```
```    (Intercept)       grouppink treatmentpixies
141.56771       -49.75414        23.52428```

The estimate for a pink comb gnome with pixies is:

`coef(model)[1] + coef(model)[2] + coef(model)[3]`

There are alternative codings (”contrasts”) that you can use. A common one in Anova is to use the intercept as the grand mean and the coefficients as deviations from the mean. (So that the coefficients for different levels of the same factor sum to zero.) We can get this setting in R by changing the contrasts option, and then rerun the model. However, whether the coefficients are easily interpretable or not, they still lead to the same means, and we can always calculate the values of the combinations of levels that interest us.

Instead of typing in the formulas ourself as above, we can get predictions from the model with the predict( ) function. We need a data frame of the new values to predict, which in this case means one row for each combination of the levels of group and treatment. Since we have too levels each there are only for of them, but in general we can use the expand.grid( ) function to generate all possible factor levels. We’ll then get the predictions and their confidence intervals, and bundle everything together to one handy data frame.

```levels <- expand.grid(group=c("green", "pink"), treatment=c("control", "pixies"))
predictions <- predict(model, levels, interval="confidence")
predicted.data <- cbind(levels, predictions)
```
```  group treatment       fit       lwr      upr
1 green   control 141.56771 125.82527 157.3101
2  pink   control  91.81357  76.48329 107.1439
3 green    pixies 165.09199 149.34955 180.8344
4  pink    pixies 115.33785  98.93425 131.7414```

Now that we have these intervals in a data frame we can plot them just like we would any other values. Back in part II, we put several categorical variables into the same plot by colouring the points. Now, let’s introduce nice feature of ggplot2: making small multiples with faceting. qplot( ) takes facets argument which is a formula where the left hand side, before the tilde (‘~’), will be used to split the plot vertically, and the right hand side will split the plot horizontally. In this case, we split horizontally, each panel representing one level of the treatment variable. Also, we use a new geometry: pointrange, which draws a point with bars above and below it and is quite suitable for the intervals we’ve got.

```qplot(x=treatment, facets=~group,
y=fit, ymax=upr, ymin=lwr
geom="pointrange", data=predicted.data)
```

That’s good, but combining the predictions from the model and the actual data in the same plot would be nice. In ggplot2, every plot is an object that can be saved away to a variable. Then we can use the addition operator to add layers to the plot. Let’s make a jittered dotplot like the above and then add a layer with the pointrange geometry displaying confidence intervals. The scatter of the data points around the confidence intervals reminds us that there is quite a bit of residual variance. The coefficient of determination, as seen in the summary earlier, was about 0.25.

```qplot(x=treatment, y=mass0, facets=~group, geom="jitter", data=data) +
geom_pointrange(aes(y=fit, ymax=upr, ymin=lwr), colour="red", data=predicted.data)
```

In the above, we make use of ggplot2′s more advanced syntax for specifying plots. The addition operator adds layers. The first layer can be set up with qplot(), but the following layers are made with their respective functions. Mapping from variables to features of the plot, called aesthetics, have to be put inside the aes() function. This might look a bit weird in the beginning, but it has its internal logic — all this is described in Hadley Wickham’s ggplot2 book.

We should probably try a regression line as well. The abline geometry allows us to plot a line with given intercept and slope, i.e. the coefficients of a simple regression. Let us simplify a little and look at the mass at time zero and the log-transformed mass at time 50 in only the green group. We make a linear model that uses the same slope for both treatments and a treatment-specific intercept. (Exercise for the reader: look at the coefficients with coef( ) and verify that I’ve pulled out the intercepts and slope correctly.) Finally, we plot the points with qplot and add the lines one layer at the time.

```green.data <- subset(data, group=="green")
model.green <- lm(log(mass50) ~ mass0 + treatment, green.data)
intercept.control <- coef(model.green)[1]
intercept.pixies <- coef(model.green)[1]+coef(model.green)[3]
qplot(x=mass0, y=log(mass50), colour=treatment, data=green.data) +
geom_abline(intercept=intercept.pixies, slope=coef(model.green)[2]) +
geom_abline(intercept=intercept.control, slope=coef(model.green)[2])
```

### 12. Using pseudorandom numbers for sanity checking

There is a short step from playing with regression functions that we’ve fitted, like we did above, to making up hypothetical regression functions and simulating data from them. This type of fake-data simulation is very useful to for testing how designs and estimation procedures behave and check things like the control of false positive rate and the power to accurately estimate a known model.

The model will be the simplest possible: a single categorical predictor with only two levels and normally distributed equal error variance, i.e. a t-test. There is a formula for the power of the t-test and an R function, power.t.test( ), that calculates it for us without the need for simulation. However, a nice thing about R is that we can pretty easily replace the t-test with more complex procedures. Any model fitting process that you can program in R can be bundled into a function and applied to pseudorandom simulated data. In the next episode we will go into how to make functions and apply them repeatedly.

Let us start out with a no effect model: 50 observations in two groups drawn from the same distribution. We use the mean and variance of the green control group. This first part just sets up the variables:

```mu <- mean(subset(data, group=="green" & treatment=="control")\$mass0)
sigma <- sd(subset(data, group=="green" & treatment=="control")\$mass0)
treatment <- c(rep(1, 50), rep(0, 50))
```

The rnorm( ) function generates numbers from a normal distribution with specified mean and standard deviation. Apart from drawing numbers from it, R can of course pull out various table values, and it knows other distributions as well. Look at the documentation in ?distributions. Finally we perform a t-test. Most of the time, it should not show a significant effect, but sometimes it will.

```sim.null <- rnorm(100, mu, sigma)
t.test(sim.null ~ treatment)\$p.value
```

We can use the replicate( ) function to evaluate an expression multiple times. We put the simulation and t-test together into one expression, rinse and repeat. Finally, we check how many of the 1000 replicates gave a p-value below 0.05. Of course, it will be approximately 5% of them.

```sim.p <- replicate(1000, t.test(rnorm(100, mu, sigma) ~ treatment)\$p.value)
length(which(sim.p < 0.05))/1000
```
`[1] 0.047`

Let us add an effect! Say we’re interested in an effect that we expect to be approximately half the difference between the green and pink comb gnomes:

```d <- mean(subset(data, group=="green" & treatment=="control")\$mass0) -
mean(subset(data, group=="pink" & treatment=="control")\$mass0)
sim.p.effect <- replicate(1000, t.test(treatment * d/2 +
rnorm(100, mu, sigma) ~ treatment)\$p.value)
length(which(sim.p.effect < 0.05))/1000
```
`[1] 0.737`

We see that with 50 individuals in each group and this effect size we will detect a significant difference about 75% of the time. This is the power of the test. If you are able to find nice and trustworthy prior information about the kind of effect sizes and variances you expect to find in a study, design analysis allows you to calculate for instance how big a sample you need to have good power. Simulation can also give you an idea of how badly a statistical procedure will break if the assumptions don’t hold. We can try to simulate a situation where the variances of the two groups differs quite a bit.

```sim.unequal <- replicate(1000, t.test(c(rnorm(50, mu, sigma),
rnorm(50, mu, 2*sigma)) ~ treatment)\$p.value)
length(which(sim.unequal < 0.05))/1000
```
`[1] 0.043`
```sim.unequal.effect <- replicate(1000, t.test(c(rnorm(50, mu+d/2, sigma),
rnorm(50, mu, 2*sigma)) ~ treatment)\$p.value)
length(which(sim.unequal.effect < 0.05))/1000
```
`[1] 0.373`

In conclusion, the significance is still under control, but the power has dropped to about 40%. I hope that has given a small taste of how simulation can help with figuring out what is going on in our favourite statistical procedures. Have fun!

Skrivet av mrtnj

11 november, 2013 vid 21:28

## R intro seminars, take 2: some slides about data frames, linear models and statistical graphics

lämna en kommentar »

I am doing a second installment of the lunch seminars about data analysis with R for the members of the Wright lab. It’s pretty much the same material as before — data frames, linear models and some plots with ggplot2 — but I’ve sprinkled in some more exercises during the seminars. I’ve tried emphasising scripting a bit more than last time, and made a lot of use of RStudio. Going through this first part has taken four hours, but that includes each seminar a quick review of what we did last time and lots of questions. Next week we’ll get started on gene expression microarray data, and I’ll try introducing both limma and plyr.

(My previous introduction materials are posted here. Comments, suggestions and ideas about teaching R to biologists are always welcome!)

Skrivet av mrtnj

7 november, 2013 vid 17:38

Publicerat i computer stuff, data analysis, english

## Using R: drawing several regression lines with ggplot2

lämna en kommentar »

Occasionally I find myself wanting to draw several regression lines on the same plot, and of course ggplot2 has convenient facilities for this. As usual, don’t expect anything profound from this post, just a quick tip!

There are several reasons we might end up with a table of  regression coefficients connecting two variables in different ways. For instance, see the previous post about ordinary and orthogonal regression lines, or as a commenter suggested: quantile regression. I’ve never used quantile regression myself, but another example might be plotting simulations from a regression or multiple regression lines for different combinations of predictors.

Let’s start with a couple of quantile regressions. Ordinary regression compares the mean difference in a response variable between different values of the predictors, while quantile regression models some chosen quantiles of the response variable. The rq function of Roger Koenker’s quantreg package does quantile regression. We extract the coefficient matrix and make a dataframe:

```library(quantreg)
model.rq <- rq(Temp ~ Wind, airquality, tau=c(0.25, 0.5, 0.75))
quantile.regressions <- data.frame(t(coef(model.rq)))
colnames(quantile.regressions) <- c("intercept", "slope")
quantile.regressions\$quantile <- rownames(quantile.regressions)
quantile.regressions
```
```         intercept     slope  quantile
tau= 0.25 85.63636 -1.363636 tau= 0.25
tau= 0.50 93.03448 -1.379310 tau= 0.50
tau= 0.75 94.50000 -1.086957 tau= 0.75```

The addition of the quantile column is optional if you don’t feel the need to colour the lines.

```library(ggplot2)
scatterplot <- qplot(x=Wind, y=Temp, data=airquality)
scatterplot + geom_abline(aes(intercept=intercept, slope=slope,
colour=quantile), data=quantile.regressions)
```

We use the fact that ggplot2 returns the plot as an object that we can play with and add the regression line layer, supplying not the raw data frame but the data frame of regression coefficients.

Skrivet av mrtnj

2 juni, 2013 vid 16:30

## ”How to draw the line” with ggplot2

med 6 kommentarer

In a recent tutorial in the eLife journal, Huang, Rattner, Liu & Nathans suggested that researchers who draw scatterplots should start providing not one but three regression lines. I quote,

Plotting both regression lines gives a fuller picture of the data, and comparing their slopes provides a simple graphical assessment of the correlation coefficient. Plotting the orthogonal regression line (red) provides additional information because it makes no assumptions about the dependence or independence of the variables; as such, it appears to more accurately describe the trend in the data compared to either of the ordinary least squares regression lines.

Not that new, but I do love a good scatterplot, so I decided to try drawing some lines. I use the temperature and wind variables in the air quality data set (NA values removed). We will  need Hadley Wickham’s ggplot2 and Bendix Carstensen’s, Lyle Gurrin’s and Claus Ekstrom’s MethComp.

```library(ggplot2)
library(MethComp)

data(airquality)
data <- na.exclude(airquality)
```

Let’s first make the regular old scatterplot with a regression (temperature as response; wind as predictor):

```plot.y <- qplot(y=Temp, x=Wind, data=data)
model.y <- lm(Temp ~ Wind, data)
coef.y <- coef(model.y)
plot.y + geom_abline(intercept=coef.y[1],
slope=coef.y[2])
```

And then, a regular old scatterplot of the other regression:

```plot.x <- qplot(y=Wind, x=Temp, data=data)
model.x <- lm(Wind ~ Temp, data)
coef(model.x)
plot.x + geom_abline(intercept=coef(model.x)[1],
slope=coef(model.x)[2])
```

So far, everything is normal. To put both lines in the same plot, we’ll need to rearrange the coefficients a little. From the above regression we get the equation x = a + by, and we rearrange it to y = – a / b + (1 / b) x.

```rearrange.coef <- function(coef) {
alpha <- coef[1]
beta <- coef[2]
new.coef <- c(-alpha/beta, 1/beta)
names(new.coef) <- c("intercept", "slope")
return(new.coef)
}
coef.x <- rearrange.coef(coef(model.x))
```

The third regression line is different: orthogonal, total least squares or Deming regression. There is a function for that in the MethComp package.

```deming <- Deming(y=airquality\$Temp, x=airquality\$Wind)
deming
```

We can even use the rearrange.coef function above to see that the coefficients of Deming regression does not depend on which variable is taken as the response or predictor:

```rearrange.coef(deming)
```
``` intercept      slope
24.8083259 -0.1906826```
`Deming(y=airquality\$Wind, x=airquality\$Temp)[1:2]`
``` Intercept      Slope
24.8083259 -0.1906826```

So, here is the final plot with all three lines:

```plot.y + geom_abline(intercept=coef.y[1],
slope=coef.y[2], colour="red") +
geom_abline(intercept=coef.x[1],
slope=coef.x[2], colour="blue") +
geom_abline(intercept=deming[1],
slope=deming[2], colour="purple")
```

Now for the dénouement of this post. Of course, there’s already an R function for doing this. It’s even in the same package as I used for the Deming regression; I just didn’t immediately rtfm. It uses base R graphics, though, and I tend to prefer ggplot2:

```plot(x=data\$Wind, y=data\$Temp)
bothlines(x=data\$Wind, y=data\$Temp, Dem=T, col=c("red", "blue", "purple"))
```

This is the resulting plot (and one can even check out the source code of bothlines and see that I did my algebra correctly).

In this case, as well as in a few others that I tried, it seems like one of the ordinary regression lines (in this case the one with wind as response and temperature as predictor) is much closer to the Deming regression line than the other. I wonder under what circumstances that is the case, and if it tells anything useful about the variables. I welcome any thoughts on this matter from you, dear reader.

Literature

Huang L, Rattner A, Liu H, Nathans J. (2013) Tutorial: How to draw the line in biomedical research. eLife e00638 doi:10.7554/eLife.00638

Skrivet av mrtnj

31 maj, 2013 vid 00:28

## Slides and exercise from my second R intro seminar

med 2 kommentarer

This week I held the second introductory seminar on R, and I think it went pretty well — though I guess you really should ask my colleagues if you want to know. The first seminar was a lecture, and this seminar was a tutorial where we made some plots and calculated a few of the usual statistics. Of course the only real way to learn R is to play with it, but I hope this couple of hours provided a decent opening to getting started with R.

I actually think RStudio made it quite a bit easier. One important thing that I certainly should’ve stressed more, though, is organising code into scripts. I mentioned it in the first seminar, but I should have included it into the exercise. Maybe the first section should be something like ”Start a new script file for your analysis: Select File > New > R script to open the editor area. Save the script as unicorn_analysis.R, and for the rest of the tutorial, write your code in that window.”

Files from the seminar:

Slides about common statistical functions in R (again, ugly walls of text meant more as notes for future reference than actual slides)

Exercises for the tutorial and the associated dataset

My suggested solutions to the exercises

Skrivet av mrtnj

28 april, 2013 vid 15:00

Publicerat i data analysis, english

## Using R: Correlation heatmap with ggplot2

med 11 kommentarer

Just a short post to celebrate that I learned today how incredibly easy it is to make a heatmap of correlations with ggplot2 (and reshape2, of course).

```data(attitude)
library(ggplot2)
library(reshape2)
qplot(x=Var1, y=Var2, data=melt(cor(attitude)), fill=value, geom="tile")
```

So, what is going on in that short passage? cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones). melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value. We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.

Skrivet av mrtnj

21 mars, 2013 vid 23:52

Publicerat i data analysis, english

## A slightly different introduction to R, part IV

med en kommentar

Now, after reading in data, making plots and organising commands with scripts and Sweave, we’re ready to do some numerical data analysis. If you’re following this introduction, you’ve probably been waiting for this moment, but I really think it’s a good idea to start with graphics and scripting before statistical calculations.

We’ll use the silly comb gnome dataset again. If you saved an Rdata file in part II, you can load it with

`load("comb_gnome_data.Rdata")`

If not, you can run this. Remind yourself what the changes to the melted data mean:

```data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
melted\$time <- 0
melted\$time[which(melted\$variable=="mass10"] <- 10
melted\$time[which(melted\$variable=="mass25")] <- 25
melted\$time[which(melted\$variable=="mass50")] <- 50
melted\$id <- factor(melted\$id)
```

We’ve already looked at some plots and figured out that there looks to be substantial differences in mass between the green and pink groups, and the control versus treatment. Let’s try to substantiate that with some statistics.

### 9. Mean and covariance

Just like anything in R, statistical tools are functions. Some of them come in special packages, but base R can do a lot of stuff out of the box.

Comparsion of two means: We’ve already gotten means from the mean() function and from summary(). Variance and standard deviation are calculated with var() and sd() respectively. Comparing the means betweeen two groups with a t-test or a Wilcoxon-Mann-Whitney test is done with t.test() and wilcox.test(). The functions have the word test in their names, but t-test gives not only the test statistics and p-values, but also estimates and confidence intervals. The parameters are two vectors of values of each group (i.e. a column from the subset of a data frame), and some options.

Looking back at this plot, I guess no-one is surprised by a difference in birthweigh between pink and green comb gnomes:

```t.test(subset(data, group=="pink")\$mass0, subset(data, group=="green")\$mass0)
```
```	Welch Two Sample t-test

data:  subset(data, group == "pink")\$mass0 and subset(data, group == "green")\$mass0
t = -5.397, df = 96.821, p-value = 4.814e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-69.69294 -32.21577
sample estimates:
mean of x mean of y
102.3755  153.3298```

That is, we feed in two pieces of data (two vectors, really, which is what you get pulling out a column from a data frame). The above is the typical situation when you have all data points in one column and a group indicator in another. Hence you begin by subsetting the data frame to get the right rows, and pull out the right columns with the \$. t.test also does paired tests, with the additional parameter paired=T.

`wilcox.test(subset(data, group=="pink")\$mass50, subset(data, group=="green")\$mass50)`
```	Wilcoxon rank sum test with continuity correction

data:  subset(data, group == "pink")\$mass50 and subset(data, group == "green")\$mass50
W = 605, p-value = 1.454e-05
alternative hypothesis: true location shift is not equal to 0```

Recalling histograms for the comb gnome weights, the use of the Wilcoxon-Mann-Whitney for masses at tim 50 and a t-test for the masses at birth (t=0) probably makes sense. However, we probably want to make use of all the time points together rather than doing a test for each time point, and we also want to deal with both the colour and the treatment at the same time.

Before we get there, let’s look at correlation:

```cor(data\$mass10, data\$mass25)
cor(data\$mass0, data\$mass50, method="spearman")
cor.test(data\$mass10, data\$mass25)
```

The cor() function gives you correlation coefficients, both Pearson, Spearman and Kendall. If you want the covariance, cov() is the function for that. cor.test() does associated tests and confidence intervals. One thing to keep in mind is missing values. This data set is complete, but try this:

```some.missing <- data\$mass0
some.missing[c(10, 20, 30, 40:50)] <- NA
cor(some.missing, data\$mass25)
cor(some.missing, data\$mass10, use="pairwise")
```

The use parameter decides what values R should include. The default is all, but we can choose pairwise complete observations instead.

If you have a big table of variables that you’d like to correlate with each other, the cor() function works for them as well. (Not cor.test(), though. However, the function can be applied across the rows of a data frame. We’ll return to that.)

### 10. A couple of simple linear models

Honestly, most of the statistics in biology is simply linear models fit with least squares and tested with a normal error model. A linear model looks like this

yi = b0 + b1x1i + b2x2i + … bnxni + ei

where y is the response variable, the xs are predictors, i is an index over the data points, and ei are the errors. The error is the only part of the equations that is a random variable. b0, …, bn are the coefficients — your main result, showing how the mean difference in the response variable between data points with different values of the predictors. The coefficients are fit by least squares, and by estimating the variance of the error term, we can get some idea of the uncertainty in the coefficients.

Regression coefficients can be interpreted as predictions about future values or sometimes even as causal claims (depending on other assumptions), but basically, they describe differences in mean values.

This is not a text on linear regression — there are many of those; may I suggest the books by Faraway or Gelman and Hill — suffice to say that as long as the errors are independent and have equal variance, least squares is the best unbiased estimate. If we also assume that the errors are normally distributed, the least squares is also the maximum likelihood estimate. (And it’s essentially the same as a Bayesian version of the linear model with vague priors, just for the record.)

In R, the lm() function handles linear models. The model is entered as a formula of the type response ~ predictor + predictor * interacting predictors. The error is implicit, and assumed to be normally distributed.

```model <- lm(mass0 ~ group + treatment, data=data)
summary(model)```
```Call:
lm(formula = mass0 ~ group + treatment, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-86.220 -32.366  -2.847  35.445  98.417

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      141.568      7.931  17.850  < 2e-16 ***
grouppink        -49.754      9.193  -5.412 4.57e-07 ***
treatmentpixies   23.524      9.204   2.556   0.0122 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 45.67 on 96 degrees of freedom
Multiple R-squared:  0.28,    Adjusted R-squared: 0.265
F-statistic: 18.67 on 2 and 96 DF,  p-value: 1.418e-07```

The summary gives the coefficients, their standard errors, the p-value of a t-test of the regression coefficient, and R squared for the model. Factors are encoded as dummy variables, and R has picked the green group and the control treatment as baseline so the coefficient ”grouppink” describes how the mean of the pink group differs from the green. Here are the corresponding confidence intervals:

`confint(model)`
```                     2.5 %    97.5 %
(Intercept)     125.825271 157.31015
grouppink       -68.001759 -31.50652
treatmentpixies   5.254271  41.79428```

(These confidence intervals are not adjusted to control the family-wise error rate, though.) With only two factors, the above table is not that hard to read, but let’s show a graphical summary. Jared Lander’s coefplot gives us a summary of the coefficients:

```install.packages("coefplot") ##only the first time
library(coefplot)
coefplot(model)```

The bars are 2 standard deviations. This kind of plot gives us a quick look at the coefficients, and whether they are far from zero (and therefore statistically significant). It is probably more useful for models with many coefficients.

There is a bunch of diagnostic plots that you can make to check for gross violations of the above assumptions of the linear model. Two useful ones are the normal quantile-quantile plot of residuals, and the residuals versus fitted scatterplot:

```library(ggplot2)
qplot(sample=residuals(model), stat="qq")
```

The quantile plot compares the distribution of the residual to the quantiles of a normal distribution — if the residuals are normally distributed it will be a straight line.

```qplot(fitted(model), residuals(model))
```

Variance should be roughly equal fitted values, and there should not be obvious patterns in the data.

If these plots look terrible, a common approach is to try to find a transformation of the data that allows the linear model to be used anyway. For instance, it often helps to take the logarithm of the response variable. Why is that so useful? Well, with some algebraic magic:

log(yi) = b0 + b1x1i + b2x2i + … + bnxni + ei, and as long as no y:s are zero,

yi = exp(b0) * exp(b1x1i) * exp(b2x2i) * .. * exp(bnxni) * exp(ei)

We have gone from a linear model to a model where the b:s and x:es multiplied together. For some types of data, this will stabilise the variance of the errors, and make the distribution closer to a normal distribution. It’s by no means a panacea, but in the comb gnome case, I hope the plots we made in part II have already convinced you that an exponential function might be involved.

Let’s look at a model where these plots look truly terrible: the weight at time 50.

```model.50 <- lm(mass50 ~ group + treatment, data=data)
qplot(sample=residuals(model.50), stat="qq")
qplot(fitted(model.50), residuals(model.50))```

Let’s try the log transform:

```model.log.50 <- lm(log(mass50) ~ group + treatment, data=data)
qplot(sample=residuals(model.log.50), stat="qq")
qplot(fitted(model.log.50), residuals(model.log.50))
coefplot(model.log.50)```

In both the above models both predictors are categorical. When dealing with categorical predictors, you might prefer the analysis of variance formalism. Anova is the same kind of linear model as regression (but sometimes parameterised slightly differently), followed by F-tests to check whether each predictor explains a significant amount of the variance in the response variable. In all the above models, the categorical variables only have two levels each, so interpretation is easy by just looking a coefficients. When you get to bigger models with lots of levels, F-tests let you test the effect of a ‘batch’ of coefficients corresponding to a variable. To see the (type II, meaning that we test each variable against the model including all other variables) Anova table for a linear model in R, do this:

```comb.gnome.anova <- aov(log(mass50) ~ group + treatment, data=data)
drop1(comb.gnome.anova)```
```Single term deletions

Model:
log(mass50) ~ group + treatment
Df Sum of Sq     RSS     AIC F value    Pr(>F)
<none>                  47.005 -67.742
group      1    30.192  77.197 -20.627  61.662 5.821e-12 ***
treatment  1    90.759 137.764  36.712 185.361 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```

Skrivet av mrtnj

21 februari, 2013 vid 20:00

Publicerat i data analysis, english

## A slightly different introduction to R, part II

med en kommentar

In part I, we looked at importing data into R and simple ways to manipulate data frames. Once we’ve gotten our data safely into R, the first thing we want to do is probably to make some plots.

Below, we’ll make some simple plots of the made-up comb gnome data. If you want to play along, load the same file we used for part I.

`data <- read.csv("comb_gnome_data.csv")`

This table contains masses of 99 comb gnomes at four time points. Gnomes come in two varieties, and are divided into two groups with exposed to different treatment. Of course, I already know what kinds of effects and features are present in the data, since I generated them. However, please bear with me while we explore a little. The point is to show some useful things R can do for you.

### 4. Making simple plots with ggplot2

R is often used for graphics, with R plots gracing poster sessions all over the world with their presence. Base R can make pretty nice plots — mainly with the plot() function — but there are also several additional graphics packages. For this introduction, I will use ggplot2 by Hadley Wickham, because it’s my favourite and if you start learning one of them it might as well be ggplot2.

ggplot2 has a rather special syntax that allows a lot of fancy things, but for now we will stick to using the quick plot function, qplot(), which works pretty similar to base R graphics.

Before we can use the package, though, we need to install it. To install a package from CRAN (the comprehensive R archive network), simply give an install.packages() command:

`install.packages("ggplot2")`

A window opens (by the way, it is beyond me why R feels the need to open a window for this task, since it almost never does, but sure why not) that asks you to select a mirror. Choose the one closest to you. The package downloads, installs, and also installs some other packages that ggplot2 needs.

CRAN, by the way, contains binaries and source for lots and lots of packages. Bioinformatics packages, though, are usually installed through Bioconductor, which has its own system. Some packages live on R Forge, a community platform. You can also install packages manually. Remember though, R packages are software, so don’t just install unknown stuff.

We need to load a package to use it. The next time you use ggplot2, you only need to do this step:

`library(ggplot2)`

Let’s get on with it. This is how to quickly make a scatterplot, a boxplot and a histogram. The graphs will open in a new window, or in your plot if you’re using Rstudio.

`qplot(x=mass0, y=mass50, data=data)`

`qplot(x=group, y=mass0, geom="boxplot", data=data)`

`qplot(x=mass0, data=data)`

The x and y parameters of course refer to the x and y axis of the graph. geom is the parameter for selecting a particular type of graph. See the ggplot2 documentation for all available geoms. If you don’t specify a geom, ggplot2 will guess it based on what types of variables you supply.

The width of bins of a histogram can be problematic — the histogram may look rather different with different bins. Putting a binwidth parameter in qplot() allows you to change the bins.

The above plots suggest that pink and green comb gnomes differ in weight at time 0. A nice alternative to the boxplot is the dotplot (jittered to reduce overplotting). We make one with mass at time 50 by treatment:

`qplot(x=treatment, y=mass50, geom="jitter", data=data)`

An alternative to the histogram is a density plot:

`qplot(x=mass50, geom="density", data=data)`

Both the dotplot and the density plot show that the distribution of comb gnome masses at time 50 is skewed to the left, with many individuals with low mass, and a few very heavy ones.

Some of the more useful parameters to qplot() are xlab, ylab, xlim, ylim, and main. They allow you to set the title of the graph (main), the x and y axis labels (xlab and ylab), and adjust the scales (xlim and ylim). In this example changing the scales make little sense, but it is sometimes useful. (These are the same parameters as you would use in the base R plot() fuction, by the way.)

`qplot(x=mass0, y=mass50, data=data, xlim=c(0,350), ylim=c(0,10000), xlab="mass at birth", ylab="mass at age 50", main="Comb gnome mass")`

Some style options are very easy to change, and can help visualising structure of data. Here, we use the colour of the dots for treatment, and the shape for group.

`qplot(x=mass0, y=mass50, colour=treatment, shape=group, data=data)`

This plot, as well as the above jittered dotplot makes it perfectly clear that there is a difference in growth between comb gnomes that are exposed to pixietails and those who are not, but also huge variation within the group of exposed comb gnomes. We can also discern the difference in initial mass between green and pink comb gnomes.

### 5. Reshaping data, and some more plots

Now, let’s take the opportunity to introduce another wonderful package by Hadley Wickham: reshape2. It is best described with an example. reshape2 should have been installed with ggplot2. Load it like so:

`library(reshape2)`

The reshape2 package centers around the concept of ‘melting’ and ‘casting’ a data frame (or array). First, recall the form our data frame is in:

`head(data)`
```  group treatment  mass0    mass10   mass25    mass50    id
1 green control    180.9564 217.1795 285.5527  450.6075  1
2 green pixies     140.1157 279.0560 784.3293  4390.4606 2
3 green control    119.0070 125.7097 136.4782  156.5143  3
4 green pixies     220.7838 366.5834 784.2983  2786.0915 4
5 green pixies     210.7911 430.9156 1259.5120 7525.7961 5
6 green control    200.7478 249.3120 345.0496  593.0786  6```

Then, look at this type of table:

`head(melt(data))`
```  group treatment variable value
1 green control   mass0    180.9564
2 green pixies    mass0    140.1157
3 green control   mass0    119.0070
4 green pixies    mass0    220.7838
5 green pixies    mass0    210.7911
6 green control   mass0    200.7478```

The second format is the what reshape2 and ggplot2 calls ‘melted’ — each row is one value with a few indicator variables (id variables) that tell you which groups a data point belongs to. The ‘variable’ column contains the name of the columns of data — in this case masses at time 0, 10, 25, and 50, as well as id, which reshape2 guessed is one of the data columns; we’ll have to fix that below — and ‘value’ their values.

Imagine making a multiple regression — this is a bit like the format you would want then. The value would be the response, and the id variables would be predictors. This is very useful not only for regression modelling, but also for plots. Look at this one for instance, that comes quite naturally once the data have been melted.

`melted <- melt(data, id.vars=c("id", "group", "treatment"))`
`qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)`

We could of course show them as boxplots instead, with geom=”boxplot”. If time was really a categorical variable with four levels, that would be great. I don’t know about you, however, but I would like to see individual growth trajectories of the comb gnomes. To accomplish this, we make a new column in the melted data frame.

`melted\$time <- 0`

It’s about time to introduce the which() function. which() gives you the indices of the elements of a vector that satisfy some logical expression. We get the indices of elements that belong to mass10, mass25 and mass50 and use them to assign the proper number to those elements of the time column.

(Note the ”==” here, used to denote equality in logical expressions. The logical equals sign must be double, or bad things will happen — i.e. R will understand the expression as an assignment rather than a comparision.)

```melted\$time[which(melted\$variable=="mass10")] <- 10

melted\$time[which(melted\$variable=="mass25")] <- 25

melted\$time[which(melted\$variable=="mass50")] <- 50

```

Since the comb gnome ids are numbers, R imported them as a numeric column.  We will need the id column shortly, so let’s make sure it is treated as categorical:

`melted\$id <- factor(melted\$id)`

We’re almost there. Looking at previous plots, comb gnome growth looks a lot like an exponential function. So let’s try taking the natural logarithm. If you don’t want to change your data frame, functions can be applied in the qplot() call:

> qplot(x=time, y=log(value), geom=”line”, color=id, data=melted)

This is a busy plot, and the legend doesn’t help. You can remove it with the following line. (This is a taste of the finer details of ggplot2. Actually, the output of qplot() is an object that can be manipulated with the addition operator.)

`qplot(x=time, y=log(value), geom="line", color=id, data=melted) + theme(legend.position="none")`

Anyway, there seems to be method to the madness — the growth of each comb gnome appears roughly linear on a log scale.

We’re done with that for now. I hope this gave a taste of the usefulness of melting data. Even better, once you have melted data, you can ‘cast’ it into some other tabular format. This is particularly useful for data that are provided in a ‘melted’ format, when you really want a cross-table.

We do this with dcast(). The ‘d’ stands for data frame. (There is also an acast() for arrays.) dcast() needs a melted data frame and a formula, consisting of the variables you want in rows, a tilde ”~”, and the variables you want in columns. Try these:

```head(dcast(melted, variable ~ id))

head(dcast(melted, id + group + treatment ~ variable))```

In cases where the variables on one side do not uniquely identify data points, you can use dcast() for data summaries, by choosing some appropriate aggregation function:

```dcast(melted, group + treatment ~ variable, fun.aggregate=mean)

dcast(melted, group + treatment ~ variable, fun.aggregate=sum)```

Overall, reshape2 can save you a lot of time and headache, even though it’s not always completely intuitive. It’s not something you will use every day, but keep it in mind when the problem arises.

### 6. Saving data

Before we move any further, let’s talk about saving data:

1. Saving graphics. Depending on what platform you work on, there will be different interface features to save the plot you’re looking at. In the windows R interface, you can right click to get a menu that allows you to copy the graph. On mac OS X, you can save the graph with an option in the menu bar. In RStudio, you can press the export button above the plot area. On windows and unix, you can write

`savePlot(file="plot.png", type="png")`

with several image types. See help file for details. On any platform, you can redirect the graphics from the screen to a file, run your plot commands and then send control back to the screen:

```pdf("some_filename.pdf")

qplot(x=mass10, y=mass25, data=data)

dev.off()```

2. Saving tables. If you have modified a data frame and want to save it to a file (for instance, you might like to use R to melt or cast a table for some other software that needs a particular format), you can use write.table() or write.csv().

`write.csv(melted, file="melted_data.csv", row.names=F, quote=F)`

3. Saving R objects. This is useful when you want to switch between scripts and sessions. Just write

`save(data, melted, file="comb.gnome.data.Rdata")`

to save the content of variables ”data” and ”melted”. You can list several variables or just one. If you want to continue with the next part, save the melted data to a file like that. When you reopen R, you can load them with:

`load("comb.gnome.data.Rdata")`

4. Saving the entire workspace. You can also save the entire contents of your workspace: all the variables you currently have assigned and the history of commands that you’ve given. But frankly, I don’t recommend it. Saving your workspace means that you can pick up exactly where you left off, but the workspace quickly turns into a mess of forgotten variables, and it’s very difficult to retrace the train of thought in an unannotated command history.

I much prefer saving selected variables on files, and making each analysis a self-contained script. In the next part, I’ll try to show you how. Now that we’ve covered a few useful thing one can do with R, it is time to have a look at how to organise R code into simple scripts, and introduce Sweave.

Skrivet av mrtnj

27 januari, 2013 vid 11:00

Publicerat i data analysis, english