## Archive for the ‘**data analysis**’ Category

## More fun with %.% and %>%

The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:

library(dplyr) data %.% melt(id.vars=c("treatment", "sex")) %.% group_by(sex, treatment, variable) %.% summarise(mean(value))

I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.

data %.% lm(formula=response1 ~ factor(sex)) %.% summary()

As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:

library(magrittr) data %>% lm(response1 ~ factor(sex), .) %>% summary()

Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development.

## Using R: quickly calculating summary statistics (with dplyr)

I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.

So, here comes the code to do the thing we did yesterday but with dplyr:

## The code for the toy data is exactly the same data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) ## reshape2 still does its thing: library(reshape2) melted <- melt(data, id.vars=c("sex", "treatment")) ## This part is new: library(dplyr) grouped <- group_by(melted, sex, treatment) summarise(grouped, mean=mean(value), sd=sd(value))

When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:

Source: local data frame [4,000 x 4] Groups: sex, treatment, variable sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407 7 1 1 response1 -0.07390246 8 1 2 response1 -1.34509686 9 1 1 response1 1.97215697 10 1 2 response1 -0.08145883

The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.

The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.

Source: local data frame [8 x 5] Groups: sex, treatment sex treatment variable mean sd 1 1 1 response1 0.021856280 1.0124371 2 1 1 response2 0.045928150 1.0151670 3 1 2 response1 -0.065017971 0.9825428 4 1 2 response2 0.011512867 0.9463053 5 2 1 response1 -0.005374208 1.0095468 6 2 1 response2 -0.051699624 1.0154782 7 2 2 response1 0.046622111 0.9848043 8 2 2 response2 -0.055257295 1.0134786

Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:

summarise(group_by(melted, sex, treatment, variable), mean=mean(value), sd=sd(value)) melted %.% group_by(sex, treatment, variable) %.% summarise(mean=mean(value), sd=sd(value))

The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R.

## Using R: quickly calculating summary statistics from a data frame

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) head(data)

sex treatment response1 response2 1 1 1 -0.15668214 -0.13663012 2 1 2 -0.40934759 -0.07220426 3 1 1 0.07103731 -2.60549018 4 1 2 0.15113270 1.81803178 5 1 1 0.30836910 0.32596016 6 1 2 -1.41891407 1.12561812

Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1)/ sqrt(nrow(subset(data, sex == 1 & treatment == 1)))

Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr) library(reshape2)

First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))

The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407

Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise, mean = mean(value), sd = sd(value), sem = sd(value)/sqrt(length(value)))

sex treatment variable mean sd sem 1 1 1 response1 0.021856280 1.0124371 0.04527757 2 1 1 response2 0.045928150 1.0151670 0.04539965 3 1 2 response1 -0.065017971 0.9825428 0.04394065 4 1 2 response2 0.011512867 0.9463053 0.04232006 5 2 1 response1 -0.005374208 1.0095468 0.04514830 6 2 1 response2 -0.051699624 1.0154782 0.04541357 7 2 2 response1 0.046622111 0.9848043 0.04404179 8 2 2 response2 -0.055257295 1.0134786 0.04532414

## Morning coffee: scripting language

Several people have asked: what scripting language should biologists learn if they are interested in doing a little larger-scale data analysis and have never programmed before? I’m not an expert, but these are the kinds of things I tend to say:

The language is not so important; the same principles apply everywhere. Use what your friends and colleagues use so you can get help from them. I believe most people would answer Python. I would answer R. Don’t believe people who tell you that R is not a serious language. You’re already familiar with analysing small datasets in a statistics program. You can do that in R too, and then the step to writing code and handling larger projects is actually very short. Your data will very likely come in tables, and R is very good at that. You’ll also want pretty graphs, and R is very good at that too. Regardless, have a look at the other common languages as well. Practice working from a terminal.

## Using R: common errors in table import

I’ve written before about importing tabular text files into R, and here comes some more. This is because I believe (firmly) that importing data is the major challenge for beginners who want to analyse their data in R. What is the most important thing about using any statistics software? To get your data into it in the first place! Unfortunately, no two datasets are the same and many frustrations await the beginner. Therefore, let me share a few things that I’ve picked up while trying to read data into R.

General tip: keep it as simple as possible. Open your favourite spreadsheet software, remove all the colours and formatting, make sure to have only one header row with short and simple column names (avoid any special characters; R will turn them to full stops), remove empty columns above and to the left of the table and export the sheet in to plain text, tab separated or comma separated. Then use either read.table, read.csv or the Import dataset button in RStudio to read your table, and in case of doubt, begin with the default settings, which are often sensible. If you see any annoying data import errors, below are a few possible causes.

**If you don’t get the number of columns you expect**

Then it’s often a separator problem. read.table allows you to set the filed separator character (with sep=) and the decimal separator (with dec=). To get a tab character, use ”\t”:

data <- read.table(file="data.txt", sep="\t", dec=",")

In many countries this is not an issue, but the Swedish standard is using a comma as decimal separator, while R uses a decimal point. If you export a comma separated value file on a Swedish computer, you are likely to get a numbers with decimal commas and semicolons as field separators. Then what you want is read.csv2( ), which is a read.csv made for semicolon separated files with decimal commas.

As an additional benefit of using it, RStudio solves those issues for you with the Import dataset button. The dialog box that appears when you click it lets you choose separators, header line (yes/no) and whether there are quotes around fields, and shows a preview of what the table will look like. Then it builds the read.table command for you, so that you can copy it into your script. If you’re not an RStudio user, just look at the actual file in a text editor, and add dec= and sep= instructions to your read.table as needed.

If you see a lot more columns than you expect, usually after the ones you expect, filled with NAs, it’s probably because you’ve happened to enter some character (very likely a whitespace …) somewhere to the right in the spreadsheet. When exported, the software will fill in all the empty cells, which R will interpret as missing values. If this happens, make sure to delete the contents of the spreadsheet after the columns you’re interested in.

Whitespaces can also be a problem if you’ve specified a custom separator, like above. Normally, read.table is very good about skipping over whitespace, should you have happened to put an extra whitespace in before that factor level. That functionality is turned off when you specify a separator, though, so you might need to switch it on again by setting strip.white=T in the function call. So, for instance, ” male” and ”female ” are interpreted as male/female, which is probably what one wants. This cost me several grey hairs and string operations before I came to my senses and read the data import manual.

**If columns are not the type you expect**

If something is a character when it should be numeric you might see messages such as ”‘x’ must be numeric ” or ”non-numeric argument to binary operator”. If something is a factor when it should be character, some character operations might fail. Regardless, it could still be a separator problem. If R expects to see decimal points and sees a comma, it will make that column a character vector rather then a numeric column. The principle is that read.table it will try to turn the column into numeric or integer (or complex or logical, but they are rarer). Most often, this means that if you feed it numbers it will make them numeric. But if there is anything else in any of the elements, it will assume character — and by default it will convert character vector into factors.

There are several reasons why you might have text among the numbers, making R not interpret the column as numeric. We’ve mentioned wrong separators, but sometimes if you paste spreadsheets from different sources together you end up with inconsistent field separators. I’ve written a post about a way to deal with that in some cases, but you can also search and replace the file with a text editor, or use a command-line tool like sed. Missing values can be another problem. read.table understands that empty fields are missing, but maybe you or your collaborator has used another character to mean missing, like ”-” or ”?”. This is dealt with by specifying your own na.strings:

data <- read.table("data.txt", na.strings=c("NA", "-", "?"))

R’s default of interpreting characters as factors is sometimes a good choice, but often what you really want is characters that you can work with and then, if needed, turn into factors. In particular, if your column of sample ids is turned into a factor, comparing and matching ids will sometimes hurt you. To avoid that you can specify stringsAsFactors=F. This works when you make data frames with the data.frame function as well.

data <- read.table("data.txt", stringsAsFactors=F)

**Diagnosing problems with your data frame**

Look at the dimensions of your data frame and the column of the classes. dim gives dimensions; the class function gives the type of a column and the str function will give you a summary of the structure of the object.

dim(data) lapply(data, class) str(data)

Happy importing!

## Using R: correlation heatmap, take 2

Apparently, this turned out to be my most popular post ever. Of course there are lots of things to say about the heatmap (or quilt, tile, guilt plot etc), but what I wrote was literally just a quick celebratory post to commemorate that I’d finally grasped how to combine reshape2 and ggplot2 to quickly make this colourful picture of a correlation matrix.

However, I realised there is one more thing that is really needed, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. We use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, it’s very easy to get a scale that has a midpoint and a different colour in each direction. It’s called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!

data <- airquality[,1:4] library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(data, use="p")), fill=value, geom="tile") + scale_fill_gradient2(limits=c(-1, 1))

## Books and lessons about ggplot2

I recently got an email from a person at Packt publishing, who suggested I write a book for them about ggplot2. My answer, which is perfectly true, is that I don’t have the time, nor the expertise to do that. What I didn’t say is that 1) a quick web search suggests that Packt doesn’t have the best reputation and 2) there are already two books about ggplot2 that I think covers the entire field: the indispensable ggplot2 book, written by Hadley Wickham, the author of the package, and the R Graphics Cookbok by Wincent Chang. There are too many decent but not great R books on the market already and there is no reason for me to spend time to create another one.

However, there are a few things I’d like to tell the novice, in general, about ggplot2:

1. ggplot2 is not necessarily superior to lattice or base graphics. It’s largely a matter of taste, you can make very nice plots with either plotting system and in some situations everybody will turn away from their favourite system and use another. For instance, a lot of built-in diagnostic plots come preprogrammed in base graphics. The great thing about ggplot2 is how it allows the user to think about plots as layers of mappings between variables and geometric objects. Base graphics and lattice are organised in different ways, which is not necessarily worse; it depends on the way you’re accustomed to thinking about statistical graphics.

2. There are two ways to start a plot in ggplot2: qplot( ) or ggplot( ). qplot is the quicker way and probably the one you should learn first. But don’t be afraid of the ggplot function! All it does is set up an empty plot and connect a data frame to it. After that you can add your layers and map your variables almost the same way you would do with qplot. After you’ve become comfortable with qplot, just try building plots with ggplot a few times, and you’ll see how similar it is.

3. The magic is not in plotting the data but in tidying and rearranging the data for plotting. Make sure to put all your labels, indicators and multiple series of data into the same data frame (most of the time: just cbind or merge them together), melt the data frame and pass it to ggplot2. If you want to layer predictions on top of your data, put output from your model in another data frame. It is perfectly possible, often advisable, to write functions that generate ggplot2 plots, but make sure to always create the data frame to be plotted first and then pass it on. I suggest not trying to create the mappings, that is x= and y= and the like, on the fly. There is always the risk of messing up the order of the vectors, and also, because ggplot2 uses metaprogramming techniques for the aesthetics, you might see unexpected behaviours when putting function calls into the mapping.

4. Worry about mapping variables and facetting first and then change the formatting. Because of how plots as well as settings in ggplot2 are objects that you can store and pass around, you can first create a raw version of the plot, and then just add (yes add, with the ”+” operator) the formatting options you want. So taken together, the workflow for making any ggplot2 plot goes something like this: 1) put your data frame in order; 2) set up a basic plot with the qplot or ggplot function; 3) add one extra geom at at time (optional if you make a simple plot with qplot, since qplot sets up the first geometry for you); 4) add the settings needed to make the plot look good.

Happy ggplotting!

## Using R: Coloured sizeplot with ggplot2

Someone asked about this and I though the solution with ggplot2 was pretty neat. Imagine that you have a scatterplot with some points in the exact same coordinates, and to reduce overplotting you want to have the size of the dot indicating the number of data points that fall on it. At the same time you want to colour the points according to some categorical variable.

The sizeplot function in the plotrix package makes this type of scatterplot. However, it doesn’t do the colouring easily. I’m sure it’s quite possible with a better knowledge of base graphics, but I tend to prefer ggplot2. To construct the same type of plot we need to count the data points. For this, I use table( ), and then melt the contingency table and remove the zeroes.

library(ggplot2) library(reshape2) data <- data.frame(x=c(0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4), y=c(0, 0, 0, 3, 1, 1, 1, 2, 2, 1, 4, 4), group=c(rep(1, 6), rep(2, 4), rep(3, 2))) counts <- melt(table(data[1:2])) colnames(counts) <- c(colnames(data)[1:2], "count") counts <- subset(counts, count != 0) sizeplot <- qplot(x=x, y=y, size=count, data=counts) + scale_size(range=c(5, 10))

This is the first sizeplot. (The original scale makes single points very tiny. Hence the custom scale for size. Play with the range values to taste!) To add colour, we merge the counts with the original data to get back the group information — and, in true ggplot2 fashion, map the group variable to colour.

counts.and.groups <- merge(counts, unique(data)) sizeplot.colour <- qplot(x=x, y=y, size=count, colour=factor(group), data=counts.and.groups) + scale_size(range=c(5, 10))

One thing that this simple script does not handle well is if points that should have different colour happen to overlap. (As it stands, this code will actually plot two points both the size of the total number of overlapping points in different colours on top of each other. That must be wrong in several ways.) However, I don’t know what would be the best behaviour in this instance. Maybe to count the number of overlaps separately and plot both points while adding some transparency to the points?

## A slightly different introduction to R, part V: plotting and simulating linear models

In the last episode (which was quite some time ago) we looked into comparisons of means with linear models. This time, let’s visualise some linear models with ggplot2, and practice another useful R skill, namely how to simulate data from known models. While doing this, we’ll learn some more about the layered structure of a ggplot2 plot, and some useful thing about the lm function.

### 11. Using points, lines and error bars to show predictions from linear models

Return to the model of comb gnome mass at time zero. We’ve already plotted the coefficient estimates, but let us just look at them with the coef() function. Here the intercept term is the mean for green comb gnomes subjected to the control treatment. The ‘grouppink’ and ‘treatmentpixies’ coefficients are the mean differences of pink comb gnomes and comb gnomes exposed to pixies from this baseline condition. This way of assigning coefficients is called dummy coding and is the default in R.

model <- lm(mass0 ~ group + treatment, data) coef(model)[1]

(Intercept) grouppink treatmentpixies 141.56771 -49.75414 23.52428

The estimate for a pink comb gnome with pixies is:

coef(model)[1] + coef(model)[2] + coef(model)[3]

There are alternative codings (”contrasts”) that you can use. A common one in Anova is to use the intercept as the grand mean and the coefficients as deviations from the mean. (So that the coefficients for different levels of the same factor sum to zero.) We can get this setting in R by changing the contrasts option, and then rerun the model. However, whether the coefficients are easily interpretable or not, they still lead to the same means, and we can always calculate the values of the combinations of levels that interest us.

Instead of typing in the formulas ourself as above, we can get predictions from the model with the predict( ) function. We need a data frame of the new values to predict, which in this case means one row for each combination of the levels of group and treatment. Since we have too levels each there are only for of them, but in general we can use the expand.grid( ) function to generate all possible factor levels. We’ll then get the predictions and their confidence intervals, and bundle everything together to one handy data frame.

levels <- expand.grid(group=c("green", "pink"), treatment=c("control", "pixies")) predictions <- predict(model, levels, interval="confidence") predicted.data <- cbind(levels, predictions)

group treatment fit lwr upr 1 green control 141.56771 125.82527 157.3101 2 pink control 91.81357 76.48329 107.1439 3 green pixies 165.09199 149.34955 180.8344 4 pink pixies 115.33785 98.93425 131.7414

Now that we have these intervals in a data frame we can plot them just like we would any other values. Back in part II, we put several categorical variables into the same plot by colouring the points. Now, let’s introduce nice feature of ggplot2: making small multiples with faceting. qplot( ) takes facets argument which is a formula where the left hand side, before the tilde (‘~’), will be used to split the plot vertically, and the right hand side will split the plot horizontally. In this case, we split horizontally, each panel representing one level of the treatment variable. Also, we use a new geometry: pointrange, which draws a point with bars above and below it and is quite suitable for the intervals we’ve got.

qplot(x=treatment, facets=~group, y=fit, ymax=upr, ymin=lwr geom="pointrange", data=predicted.data)

That’s good, but combining the predictions from the model and the actual data in the same plot would be nice. In ggplot2, every plot is an object that can be saved away to a variable. Then we can use the addition operator to add layers to the plot. Let’s make a jittered dotplot like the above and then add a layer with the pointrange geometry displaying confidence intervals. The scatter of the data points around the confidence intervals reminds us that there is quite a bit of residual variance. The coefficient of determination, as seen in the summary earlier, was about 0.25.

qplot(x=treatment, y=mass0, facets=~group, geom="jitter", data=data) + geom_pointrange(aes(y=fit, ymax=upr, ymin=lwr), colour="red", data=predicted.data)

In the above, we make use of ggplot2′s more advanced syntax for specifying plots. The addition operator adds layers. The first layer can be set up with qplot(), but the following layers are made with their respective functions. Mapping from variables to features of the plot, called aesthetics, have to be put inside the aes() function. This might look a bit weird in the beginning, but it has its internal logic — all this is described in Hadley Wickham’s ggplot2 book.

We should probably try a regression line as well. The abline geometry allows us to plot a line with given intercept and slope, i.e. the coefficients of a simple regression. Let us simplify a little and look at the mass at time zero and the log-transformed mass at time 50 in only the green group. We make a linear model that uses the same slope for both treatments and a treatment-specific intercept. (Exercise for the reader: look at the coefficients with coef( ) and verify that I’ve pulled out the intercepts and slope correctly.) Finally, we plot the points with qplot and add the lines one layer at the time.

green.data <- subset(data, group=="green") model.green <- lm(log(mass50) ~ mass0 + treatment, green.data) intercept.control <- coef(model.green)[1] intercept.pixies <- coef(model.green)[1]+coef(model.green)[3] qplot(x=mass0, y=log(mass50), colour=treatment, data=green.data) + geom_abline(intercept=intercept.pixies, slope=coef(model.green)[2]) + geom_abline(intercept=intercept.control, slope=coef(model.green)[2])

### 12. Using pseudorandom numbers for sanity checking

There is a short step from playing with regression functions that we’ve fitted, like we did above, to making up hypothetical regression functions and simulating data from them. This type of fake-data simulation is very useful to for testing how designs and estimation procedures behave and check things like the control of false positive rate and the power to accurately estimate a known model.

The model will be the simplest possible: a single categorical predictor with only two levels and normally distributed equal error variance, i.e. a t-test. There is a formula for the power of the t-test and an R function, power.t.test( ), that calculates it for us without the need for simulation. However, a nice thing about R is that we can pretty easily replace the t-test with more complex procedures. Any model fitting process that you can program in R can be bundled into a function and applied to pseudorandom simulated data. In the next episode we will go into how to make functions and apply them repeatedly.

Let us start out with a no effect model: 50 observations in two groups drawn from the same distribution. We use the mean and variance of the green control group. This first part just sets up the variables:

mu <- mean(subset(data, group=="green" & treatment=="control")$mass0) sigma <- sd(subset(data, group=="green" & treatment=="control")$mass0) treatment <- c(rep(1, 50), rep(0, 50))

The rnorm( ) function generates numbers from a normal distribution with specified mean and standard deviation. Apart from drawing numbers from it, R can of course pull out various table values, and it knows other distributions as well. Look at the documentation in ?distributions. Finally we perform a t-test. Most of the time, it should not show a significant effect, but sometimes it will.

sim.null <- rnorm(100, mu, sigma) t.test(sim.null ~ treatment)$p.value

We can use the replicate( ) function to evaluate an expression multiple times. We put the simulation and t-test together into one expression, rinse and repeat. Finally, we check how many of the 1000 replicates gave a p-value below 0.05. Of course, it will be approximately 5% of them.

sim.p <- replicate(1000, t.test(rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p < 0.05))/1000

[1] 0.047

Let us add an effect! Say we’re interested in an effect that we expect to be approximately half the difference between the green and pink comb gnomes:

d <- mean(subset(data, group=="green" & treatment=="control")$mass0) - mean(subset(data, group=="pink" & treatment=="control")$mass0) sim.p.effect <- replicate(1000, t.test(treatment * d/2 + rnorm(100, mu, sigma) ~ treatment)$p.value) length(which(sim.p.effect < 0.05))/1000

[1] 0.737

We see that with 50 individuals in each group and this effect size we will detect a significant difference about 75% of the time. This is the power of the test. If you are able to find nice and trustworthy prior information about the kind of effect sizes and variances you expect to find in a study, design analysis allows you to calculate for instance how big a sample you need to have good power. Simulation can also give you an idea of how badly a statistical procedure will break if the assumptions don’t hold. We can try to simulate a situation where the variances of the two groups differs quite a bit.

sim.unequal <- replicate(1000, t.test(c(rnorm(50, mu, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal < 0.05))/1000

[1] 0.043

sim.unequal.effect <- replicate(1000, t.test(c(rnorm(50, mu+d/2, sigma), rnorm(50, mu, 2*sigma)) ~ treatment)$p.value) length(which(sim.unequal.effect < 0.05))/1000

[1] 0.373

In conclusion, the significance is still under control, but the power has dropped to about 40%. I hope that has given a small taste of how simulation can help with figuring out what is going on in our favourite statistical procedures. Have fun!