There is grandeur in this view of life

martins bioblogg

Posts Tagged ‘Sweave

A slightly different introduction to R, part III

leave a comment »

I think you’ve noticed by now that a normal interactive R session is quite messy. If you don’t believe me, try playing around for a while and then give the history() command, which will show you the commands you’ve typed. If you’re anything like me, a lot of them are malformed attempts that generated some kind of error message.

Hence, even for quite simple tasks, keeping some kind of notebook of the R commands you’ve used is simply a must. The easiest way to keep track of what you do is probably to copy parts of your history to a text document. However, I strongly recommend that you put in a little extra effort. This part will introduce R scripts and Sweave, which is nowadays integrated into base R. Using Sweave is a little more work than a script file, but often a lot better, since it gathers the output you generate (including plots) and formats it into a pretty neat report. Even if Sweave isn’t your thing, I suggest that you at least try out the scripting approach. In the end, it is pretty much guaranteed to save time and decrease (though not completely abolish) frustration.

7. Scripting

An R script is nothing more than a text file of commands. That’s it. Simply write the commands in your favourite plain text editor, save the file — scripts usually have the file extension .R — and then give the following R command:

source("some_script.R")

R will silently perform the commands in the specified order. If anything goes wrong, an error message will appear. To write a script from within Rstudio, select File > New > R Script. An editor area will open from which you can directly run single lines or the whole script.

rstudio_script

This is an example of an R script to run some descriptive stats on the comb gnome data set:

## An example script for the R introduction blog.
## Will print some summary statistics from the comb gnome data.
data <- read.csv("comb_gnome_data.csv")

print(mean(subset(data, group=="pink")$mass0))
print(sd(subset(data, group=="pink")$mass0))
print(mean(data$mass0))
print(sd(data$mass0))

One difference from entering the commands interactively is that text output will be suppressed, unless you set the echo parameter to true in the source() function:

source("some_script.R", echo=T)

Hence the print() function, explicitly telling R to print the output of that particular function.

Also, lines that being with ”##” are comments. If you are using a good geeky text editor (such as the editor in Rstudio) it will understand R syntax and highlight comments, as well as reserved words such as ”library” above, in some stand out colour.

R doesn’t care about whitespaces, so use them liberally to break up your code into blocks of related commands.

Let’s do an  example with a plot:

data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))

library(ggplot2)
p <- qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)

Here we can make use of the fact that qplot returns the plot as an R object. We can give each plot a name and then recall them to look at them.

In the above script I repeated myself a little, which is less than ideal. For a large project, you would probably want some script to process the data and save them in a nice object, and then separate scripts for data analysis, making graphics etc.

You can of course use source() in a script, like so:

source("data_preparation.R")
qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)

I suggest, as much as possible, making sure each script can run form start to finish without user input, and beginning each script file with a little comment that explains what it does, what data files  are needed and what output the script produces.

If you have to step in and manually change the order of some columns (or a similar mundane task) before making that last plot, that’s not such a big deal the first couple of times. But if you need to revisit the analysis two months later to change some detail, you’ve probably forgotten all about that. Do your future self a big favour and script every part of the process!

8. Sweaving

Sweave is a so-called literate programming tool for R. That sounds very fancy, but when analysing data with R, it’s simply an easy way to organise code and output. You write the R code in a special type of file, and when it is run Sweave will gather all the output, including graphics, and organise them into a LaTeX document. Latex (I’m not going to stick to that silly capitalisation, sorry) is a free-as-in-freedom typesetting software, essentially a word processor where you write code instead of working in a graphical interface. It’s commonly used by technical folks for journal articles, reports, documentation and what not.

Latex is great, but sometimes not so intuitive — overall, though, I think both Microsoft Word and Open Office have caused me more frustration than Latex. It can do a lot of stuff, for instance, it’s particularly good with mathematical formulae. However, unless you want to make reports for publishing you’ll probably not have to learn that much of Latex. And as usual, most of the time you can just type any error message into a search engine to get help. (Two common problems are using reserved special characters, such as ”_” in the text, and making code blocks for graphics output, but forgetting to output any graphics. See below.)

In Rstudio, choose File > New > R Sweave. A few Latex codes are inserted from the beginning. This is about the bare minimum to make a Latex document. Move the cursor down below the ”\begin{document}” line, and start writing some text. When you want to enter code preface it with ”<<>>=” on its own line, and end the code with a ”@”:

\documentclass{article}
\begin{document}
\SweaveOpts{concordance=TRUE}

This is a Latex document. Text goes here.

<<>>=
data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
@

\end{document}

When you run the Sweave file, R will execute the code blocks just like a regular script file. The output will be captured and put into a Latex document together with the text you wrote around the code blocks.

rstudio_sweave

Sweave files usually end in .Rnw. Running a Sweave file is almost the same as running a script:

Sweave("some_sweave_file.Rnw")

After Sweaving, R helpfully suggests that you run Latex on a .tex file that it has created. If you work on a terminal in Mac OS X, some GNU/linux distribution or similar, this translates to ”latex some_report.tex” or ”pdflatex some_report.tex”. In Rstudio, you can press the Compile PDF button. To do this, you will need to have Latex installed.

When things go wrong, Sweave will terminate and display the error message as usual. It also tells you in which code block the error occured. The blocks aren’t numbered in the .Rnw file, but with the Stangle() function, you will write out an R file where the chunks are numbered. You can open it and search for the part where the error occured. Of course, you can also run it as a script file.

As mentioned above, a code block can output graphics and insert them into the report. To specify a code block with graphics output, use ”<<fig=T>>=”. In the case of ggplot2 graphics, which we have used in this introduction, you will have to use the print() function on them for them to be displayed.

<<fig=T>>=
print(qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted))
@

In the text between the code blocks, you can use any Latex formatting. Rstudio can help you with some, if you press the Format button.

Finally, I put the above blocks together to this Sweave file, and got this output pdf. Neat, isn’t it?

That was two ways to organise R code. I still think it’s fine, even advisable, to spend a lot of time in interactive R trying stuff out. But at the end of the session, a script or Sweave file that runs without manual tinkering should be the goal. I usually go for a Sweave file for any piece of analysis that will output plots, tables, model calculations etc, but make scripts that I can source() for more general things that I will reuse.

Okay, enough with my preaching. Next time, we’ll try some actual statistics.

Written by mrtnj

2 februari, 2013 at 21:42

Publicerat i data analysis, english

Tagged with ,

Följ

Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 161 andra följare