Inlägg taggade ‘R’
Slides and exercise from my second R intro seminar
This week I held the second introductory seminar on R, and I think it went pretty well — though I guess you really should ask my colleagues if you want to know. The first seminar was a lecture, and this seminar was a tutorial where we made some plots and calculated a few of the usual statistics. Of course the only real way to learn R is to play with it, but I hope this couple of hours provided a decent opening to getting started with R.
I actually think RStudio made it quite a bit easier. One important thing that I certainly should’ve stressed more, though, is organising code into scripts. I mentioned it in the first seminar, but I should have included it into the exercise. Maybe the first section should be something like ”Start a new script file for your analysis: Select File > New > R script to open the editor area. Save the script as unicorn_analysis.R, and for the rest of the tutorial, write your code in that window.”
Files from the seminar:
Slides about common statistical functions in R (again, ugly walls of text meant more as notes for future reference than actual slides)
Exercises for the tutorial and the associated dataset
My suggested solutions to the exercises
Slides from my R intro seminar
Here are my slides from a short introductory seminar on R (essentially going through part I of the R tutorial) last week. As magic lantern pictures go, they’re hideously ugly, but they were mostly there for future reference. Most of the seminar was spent showing RStudio. This Friday, we’ll practice some uses of qplot and make some linear models.
(I took out the Modern Major-General quote from the presentation, but you can enjoy him here instead. Don’t ask.)
Using R: reading tables that need a little cleaning
Sometimes one needs to read tables that are a bit messy, so that read.table doesn’t immediately recognize the content as numerical. Maybe some weird characters are sprinkled in the table (ever been given a table with significance stars in otherwise numerical columns?). Some search and replace is needed. You can do this by hand, and I know this is a task for which many people would turn to a one-liner of perl or awk, but I like to do my scripting in R.
Again, this is in the ”trivial of you only think it out” category of problems. But it gives a starting point for more subtle changes to text files that one might have to do from time to time.
Assume the data look something like this (I made this up):
id;col1;col2
id1; 0,2;0.55*
id2; -,1;,23
Thankfully, R’s conversion from numbers to text is pretty good. It will understand, for instance, that both .5 and 0.5 are numbers. If the problem was just an odd decimal separator, like the Swedish decimal comma, we could just change the dec parameter to read.table. In the above, we have mixed decimal separators. Let’s start out by reading the table as character.
character.data <- read.table("some_data.csv", sep=";",
colClasses="character", header=T)
row.names(character.data) <- character.data[,1]
character.data <- character.data[,-1]
This also stores away the sample ids in row names, and removes the first column with the ids. This is optional, but for this example I assume that all columns should be numeric. If that is not the case, the code below can of course be restricted to the numeric columns, and the character (or factor) columns can be added later.
A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors. After removing characters that aren’t numbers, decimal separators, or the minus sign, we change the decimal separator, and convert the vector to numeric. Finally, we make the list a data.frame, and propagate the row names.
library(stringr)
library(plyr)
data <- data.frame(llply(character.data, function(x) {
x <- str_replace_all(x, pattern="[^0-9\\.\\,-]", replacement="")
x <- str_replace(x, pattern="\\,", replacement=".")
return(as.numeric(x))
}))
row.names(data) <- row.names(character.data)
Using R: Correlation heatmap with ggplot2
Just a short post to celebrate that I learned today how incredibly easy it is to make a heatmap of correlations with ggplot2 (and reshape2, of course).
data(attitude) library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(attitude)), fill=value, geom="tile")
So, what is going on in that short passage? cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones). melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value. We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.
A slightly different introduction to R, part IV
Now, after reading in data, making plots and organising commands with scripts and Sweave, we’re ready to do some numerical data analysis. If you’re following this introduction, you’ve probably been waiting for this moment, but I really think it’s a good idea to start with graphics and scripting before statistical calculations.
We’ll use the silly comb gnome dataset again. If you saved an Rdata file in part II, you can load it with
load("comb_gnome_data.Rdata")
If not, you can run this. Remind yourself what the changes to the melted data mean:
data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
melted$time <- 0
melted$time[which(melted$variable=="mass10"] <- 10
melted$time[which(melted$variable=="mass25")] <- 25
melted$time[which(melted$variable=="mass50")] <- 50
melted$id <- factor(melted$id)
We’ve already looked at some plots and figured out that there looks to be substantial differences in mass between the green and pink groups, and the control versus treatment. Let’s try to substantiate that with some statistics.
9. Mean and covariance
Just like anything in R, statistical tools are functions. Some of them come in special packages, but base R can do a lot of stuff out of the box.
Comparsion of two means: We’ve already gotten means from the mean() function and from summary(). Variance and standard deviation are calculated with var() and sd() respectively. Comparing the means betweeen two groups with a t-test or a Wilcoxon-Mann-Whitney test is done with t.test() and wilcox.test(). The functions have the word test in their names, but t-test gives not only the test statistics and p-values, but also estimates and confidence intervals. The parameters are two vectors of values of each group (i.e. a column from the subset of a data frame), and some options.
Looking back at this plot, I guess no-one is surprised by a difference in birthweigh between pink and green comb gnomes:
t.test(subset(data, group=="pink")$mass0, subset(data, group=="green")$mass0)
Welch Two Sample t-test data: subset(data, group == "pink")$mass0 and subset(data, group == "green")$mass0 t = -5.397, df = 96.821, p-value = 4.814e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -69.69294 -32.21577 sample estimates: mean of x mean of y 102.3755 153.3298
That is, we feed in two pieces of data (two vectors, really, which is what you get pulling out a column from a data frame). The above is the typical situation when you have all data points in one column and a group indicator in another. Hence you begin by subsetting the data frame to get the right rows, and pull out the right columns with the $. t.test also does paired tests, with the additional parameter paired=T.
wilcox.test(subset(data, group=="pink")$mass50, subset(data, group=="green")$mass50)
Wilcoxon rank sum test with continuity correction data: subset(data, group == "pink")$mass50 and subset(data, group == "green")$mass50 W = 605, p-value = 1.454e-05 alternative hypothesis: true location shift is not equal to 0
Recalling histograms for the comb gnome weights, the use of the Wilcoxon-Mann-Whitney for masses at tim 50 and a t-test for the masses at birth (t=0) probably makes sense. However, we probably want to make use of all the time points together rather than doing a test for each time point, and we also want to deal with both the colour and the treatment at the same time.
Before we get there, let’s look at correlation:
cor(data$mass10, data$mass25) cor(data$mass0, data$mass50, method="spearman") cor.test(data$mass10, data$mass25)
The cor() function gives you correlation coefficients, both Pearson, Spearman and Kendall. If you want the covariance, cov() is the function for that. cor.test() does associated tests and confidence intervals. One thing to keep in mind is missing values. This data set is complete, but try this:
some.missing <- data$mass0 some.missing[c(10, 20, 30, 40:50)] <- NA cor(some.missing, data$mass25) cor(some.missing, data$mass10, use="pairwise")
The use parameter decides what values R should include. The default is all, but we can choose pairwise complete observations instead.
If you have a big table of variables that you’d like to correlate with each other, the cor() function works for them as well. (Not cor.test(), though. However, the function can be applied across the rows of a data frame. We’ll return to that.)
10. A couple of simple linear models
Honestly, most of the statistics in biology is simply linear models fit with least squares and tested with a normal error model. A linear model looks like this
yi = b0 + b1x1i + b2x2i + … bnxni + ei
where y is the response variable, the xs are predictors, i is an index over the data points, and ei are the errors. The error is the only part of the equations that is a random variable. b0, …, bn are the coefficients — your main result, showing how the mean difference in the response variable between data points with different values of the predictors. The coefficients are fit by least squares, and by estimating the variance of the error term, we can get some idea of the uncertainty in the coefficients.
Regression coefficients can be interpreted as predictions about future values or sometimes even as causal claims (depending on other assumptions), but basically, they describe differences in mean values.
This is not a text on linear regression — there are many of those; may I suggest the books by Faraway or Gelman and Hill — suffice to say that as long as the errors are independent and have equal variance, least squares is the best unbiased estimate. If we also assume that the errors are normally distributed, the least squares is also the maximum likelihood estimate. (And it’s essentially the same as a Bayesian version of the linear model with vague priors, just for the record.)
In R, the lm() function handles linear models. The model is entered as a formula of the type response ~ predictor + predictor * interacting predictors. The error is implicit, and assumed to be normally distributed.
model <- lm(mass0 ~ group + treatment, data=data) summary(model)
Call: lm(formula = mass0 ~ group + treatment, data = data) Residuals: Min 1Q Median 3Q Max -86.220 -32.366 -2.847 35.445 98.417 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 141.568 7.931 17.850 < 2e-16 *** grouppink -49.754 9.193 -5.412 4.57e-07 *** treatmentpixies 23.524 9.204 2.556 0.0122 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 45.67 on 96 degrees of freedom Multiple R-squared: 0.28, Adjusted R-squared: 0.265 F-statistic: 18.67 on 2 and 96 DF, p-value: 1.418e-07
The summary gives the coefficients, their standard errors, the p-value of a t-test of the regression coefficient, and R squared for the model. Factors are encoded as dummy variables, and R has picked the green group and the control treatment as baseline so the coefficient ”grouppink” describes how the mean of the pink group differs from the green. Here are the corresponding confidence intervals:
confint(model)
2.5 % 97.5 % (Intercept) 125.825271 157.31015 grouppink -68.001759 -31.50652 treatmentpixies 5.254271 41.79428
(These confidence intervals are not adjusted to control the family-wise error rate, though.) With only two factors, the above table is not that hard to read, but let’s show a graphical summary. Jared Lander’s coefplot gives us a summary of the coefficients:
install.packages("coefplot") ##only the first time
library(coefplot)
coefplot(model)
The bars are 2 standard deviations. This kind of plot gives us a quick look at the coefficients, and whether they are far from zero (and therefore statistically significant). It is probably more useful for models with many coefficients.
There is a bunch of diagnostic plots that you can make to check for gross violations of the above assumptions of the linear model. Two useful ones are the normal quantile-quantile plot of residuals, and the residuals versus fitted scatterplot:
library(ggplot2) qplot(sample=residuals(model), stat="qq")
The quantile plot compares the distribution of the residual to the quantiles of a normal distribution — if the residuals are normally distributed it will be a straight line.
qplot(fitted(model), residuals(model))
Variance should be roughly equal fitted values, and there should not be obvious patterns in the data.
If these plots look terrible, a common approach is to try to find a transformation of the data that allows the linear model to be used anyway. For instance, it often helps to take the logarithm of the response variable. Why is that so useful? Well, with some algebraic magic:
log(yi) = b0 + b1x1i + b2x2i + … + bnxni + ei, and as long as no y:s are zero,
yi = exp(b0) * exp(b1x1i) * exp(b2x2i) * .. * exp(bnxni) * exp(ei)
We have gone from a linear model to a model where the b:s and x:es multiplied together. For some types of data, this will stabilise the variance of the errors, and make the distribution closer to a normal distribution. It’s by no means a panacea, but in the comb gnome case, I hope the plots we made in part II have already convinced you that an exponential function might be involved.
Let’s look at a model where these plots look truly terrible: the weight at time 50.
model.50 <- lm(mass50 ~ group + treatment, data=data) qplot(sample=residuals(model.50), stat="qq") qplot(fitted(model.50), residuals(model.50))
Let’s try the log transform:
model.log.50 <- lm(log(mass50) ~ group + treatment, data=data) qplot(sample=residuals(model.log.50), stat="qq") qplot(fitted(model.log.50), residuals(model.log.50)) coefplot(model.log.50)
In both the above models both predictors are categorical. When dealing with categorical predictors, you might prefer the analysis of variance formalism. Anova is the same kind of linear model as regression (but usually parameterised slightly differently), followed by F-tests to check whether each predictor explains a significant amount of the variance in the response variable. To see the Anova table for a linear model in R, do this:
comb.gnome.anova <- aov(log(mass50) ~ group + treatment, data=data) summary(comb.gnome.anova)
Df Sum Sq Mean Sq F value Pr(>F) group 1 35.87 35.87 73.26 1.84e-13 *** treatment 1 90.76 90.76 185.36 < 2e-16 *** Residuals 96 47.00 0.49 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
More Haskell: a bootstrap
So my playing around with Haskell goes on. You can follow the progress of the little bootstrap exercise on github. Now it’s gotten to the point where it actually does a bootstrap interval for the mean of a sample. Consider the following R script:
n <- 100
fake.data <- data.frame(group=rep(1, n), data=rpois(n, 10))
write.table(fake.data, quote=F, row.names=F, col.names=F,
sep=",", file="fake_data.csv")
library(plyr)
bootstrap.replicates <- llply(vector("list", 100),
sample, x=fake.data$data,
replace=T, size=n)
bootstrap.means <- unlist(llply(bootstrap.replicates, mean))
print(mean(fake.data$data))
print(quantile(bootstrap.means, c(0.025, 0.975)))
[1] 10.31
2.5% 97.5%
9.72475 10.85200
So, that was a simple bootstrap in R: we get some draws from a Poisson distribution, sample 100 times from the data with replacement, and summarise the replicates. This is my Haskell thing running in GHCi:
*Main> main "boot" "will eventually bootstrap, if martin knows his stuff" fake_data.csv [8,6,11,16,5,11,12,12,7,9,13,13,12,7,13,7,7,11,9,14,14,13,10,14,17,12,8, 10,15,12,13,13,7,10,9,6,7,8,10,12,10,10,10,12,11,8,16,12,13,13,12,15,7, 7,8,9,5,7,13,10,12,11,8,6,12,14,12,14,6,9,10,9,10,6,9,7,6,12,13,7,11,7, 13,15,10,10,9,12,12,6,10,6,8,10,13,8,9,13,12,13] 10.31 (9.8,10.83)
It’s certainly not the prettiest thing in the world (for one thing, it will crash if there is an extra line break at the end of the file). Next stop: type declarations! Haskell will infer the types for me, but it is probably a good idea to declare the intended types. Or at least to be able to do so is. Then the plan is to make some use of the first column in the data file, i.e. group the sample belongs to, to add a second sample and make a comparison between the means. And then it’s pretty much done and maybe I’ll move on to something more useful. I’m thinking that implementing least squares linear models would be a decent exercise?
Using R: accessing PANTHER classifications
Importing, subsetting, merging and exporting various text files with annotation (in the wide sense, i.e. anything that might help when interpreting your experiment) is not computation and it’s not biology either, but it’s housekeeping that needs to be done. Everyone has a weapon of choice for general-purpose scripting and mine is R. Yes, this is ”quite trivial if you only think it out”, but it still takes me some time. This script is a work in progress and could certainly be much cleaner and friendlier, but I’ll post it for the benefit of any other user that might google for this kind of thing.
Panther is a protein database with phylogenetic trees of proteins annotated with Gene Ontology terms and organised into pathways. (See the paper in the 2013 database issue of NAR.) Right now, I’m after pathway classification of chicken proteins. The pathways are also available in some xml formats for systems biologists, but I’m going to use the classification table. It contains all UniProtKB proteins, so it should cover most known genes products.
Note, if you just want to check up on a few genes or a few pathways the Panther web interface seems pretty nice. If you’re after nice pathway diagrams, also check out the website and the SBML format. But for accessing classifications en masse, a treatment in R may be useful.
For this post, I’ve broken down the script into parts. If you want a function for the whole thing, see github.
First, download the classification data for your favourite organism from the Panther ftp.
panther <- read.delim(filename, sep="\t", head=F,
stringsAsFactors=F)
colnames(panther)[1] <- "gene.id"
colnames(panther)[3:5] <- c("panther.id", "panther.family",
"panther.subfamily")
colnames(panther)[6:8] <- c("go.mf", "go.bp", "go.cc")
colnames(panther)[9:10] <- c("panther.ontology", "panther.pathway")
This is a tab separated text file. Since some fields at the end of lines may be left empty, we use read.delim instead of read.table. I add som column names that I think agrees reasonably well with the readme. The first is a gene id string that contains mapping to Ensembl or Entrez ids, as well as the uniprot accession. It looks something like this:
CHICK|Gene=ENSGALG00000013995|UniProtKB=F1P447
CHICK|ENTREZ=378784|UniProtKB=Q75XU5
Of course, we want the ids as separate columns: stringr does regular expressions in a vectorised manner. str_match gets grouped matches into arrays. (Yes, it can be done with a single regexp, but I don’t take pleasure in that kind of thing.)
panther$ensembl.id <- str_match(panther$gene, "Gene=(.*)\\|")[,2] panther$uniprot.id <- str_match(panther$gene, "UniProtKB=(.*)$")[,2] panther$entrez.id <- str_match(panther$gene, "ENTREZ=(.*)\\|")[,2]
The gene ontology fields are terms from each ontology, stringed together and separated by semicolons. This is not the best format for querying, so I’ll make a list of GO ids from each ontology:
parse.go <- function(go.column) {
go.list <- str_match_all(go.column, "GO:([0-9]*)")
names(go.list) <- panther$gene.id.string
go.list <- llply(go.list, function(x) {if (!is.null(dim(x))) x[,1]})
return(go.list)
}
go.mf <- parse.go(panther$go.mf)
go.bp <- parse.go(panther$go.bp)
go.cc <- parse.go(panther$go.cc)
Finally, the pathway column. It says this in the readme:
***Example Pathway:
Inflammation mediated by chemokine and cytokine signaling pathway#Inflammation mediated by chemokine and cytokine signaling pathway#P00031>Integrin#Integrin#P00853;Integrin signalling pathway#Integrin signalling pathway#P00034>Integrin alpha#Integrin alpha#P00941The format of the pathway information is: pathway_long_name#pathway_short_name#pathway_accession>
component_long_name#component_short_name#component_accessionExplanation of pathway accessions:
Gxxxxx Gene or RNA
Pxxxxx Protein
Sxxxxx small molecules
Uxxxxx others, such as ”unknown”, etc.
For now, I’d like to keep just the pathway id, which is the first id of each pathway entry. Again, there are often multiple entries for the same gene.
pathway.list <- str_match_all(panther$panther.pathway,
"#([G,P,S,U][0-9]*)>")
names(pathway.list) <- panther$gene.id.string
pathway.list <- llply(pathway.list,
function(x) {if (!is.null(dim(x))) x[,2]})
We might want to have these additional descriptions (names of GO terms, name of pathway) later, but for now, I’ll be satisfied with the unique ids. Let’s package everything in a list, and slap a class attribute on it for good measure.
panther.classification <- list(data=panther,
go.mf=go.mf,
go.bp=go.bp,
go.cc=go.cc,
panther.pathway=pathway.list)
class(panther.classification) <- "panther.classification"
Now we can get the pathway ids associated with our favourite gene. Take for example GNB3 which has Uniprot id E1C9D6.
get.pathways <- function(panther, acc) {
ix.uni <- which(panther$data$uniprot.id == acc)
ix.ens <- which(panther$data$ensembl.id == acc)
ix.ent <- which(panther$data$entrez.id == acc)
if (length(ix.uni) > 0) {
ix <- ix.uni
} else if (length(ix.ens > 0)) {
ix <- ix.ens
} else if (length(ix.ent > 0)){
ix <- ix.ent
}
panther$panther.pathway[[ix]]
}
get.pathways(panther.classification, "E1C9D6")
I'll get back to this to maybe do something more useful with it.
A slightly different introduction to R, part III
I think you’ve noticed by now that a normal interactive R session is quite messy. If you don’t believe me, try playing around for a while and then give the history() command, which will show you the commands you’ve typed. If you’re anything like me, a lot of them are malformed attempts that generated some kind of error message.
Hence, even for quite simple tasks, keeping some kind of notebook of the R commands you’ve used is simply a must. The easiest way to keep track of what you do is probably to copy parts of your history to a text document. However, I strongly recommend that you put in a little extra effort. This part will introduce R scripts and Sweave, which is nowadays integrated into base R. Using Sweave is a little more work than a script file, but often a lot better, since it gathers the output you generate (including plots) and formats it into a pretty neat report. Even if Sweave isn’t your thing, I suggest that you at least try out the scripting approach. In the end, it is pretty much guaranteed to save time and decrease (though not completely abolish) frustration.
7. Scripting
An R script is nothing more than a text file of commands. That’s it. Simply write the commands in your favourite plain text editor, save the file — scripts usually have the file extension .R — and then give the following R command:
source("some_script.R")
R will silently perform the commands in the specified order. If anything goes wrong, an error message will appear. To write a script from within Rstudio, select File > New > R Script. An editor area will open from which you can directly run single lines or the whole script.
This is an example of an R script to run some descriptive stats on the comb gnome data set:
## An example script for the R introduction blog.
## Will print some summary statistics from the comb gnome data.
data <- read.csv("comb_gnome_data.csv")
print(mean(subset(data, group=="pink")$mass0))
print(sd(subset(data, group=="pink")$mass0))
print(mean(data$mass0))
print(sd(data$mass0))
One difference from entering the commands interactively is that text output will be suppressed, unless you set the echo parameter to true in the source() function:
source("some_script.R", echo=T)
Hence the print() function, explicitly telling R to print the output of that particular function.
Also, lines that being with ”##” are comments. If you are using a good geeky text editor (such as the editor in Rstudio) it will understand R syntax and highlight comments, as well as reserved words such as ”library” above, in some stand out colour.
R doesn’t care about whitespaces, so use them liberally to break up your code into blocks of related commands.
Let’s do an example with a plot:
data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
library(ggplot2)
p <- qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)
Here we can make use of the fact that qplot returns the plot as an R object. We can give each plot a name and then recall them to look at them.
In the above script I repeated myself a little, which is less than ideal. For a large project, you would probably want some script to process the data and save them in a nice object, and then separate scripts for data analysis, making graphics etc.
You can of course use source() in a script, like so:
source("data_preparation.R")
qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)
I suggest, as much as possible, making sure each script can run form start to finish without user input, and beginning each script file with a little comment that explains what it does, what data files are needed and what output the script produces.
If you have to step in and manually change the order of some columns (or a similar mundane task) before making that last plot, that’s not such a big deal the first couple of times. But if you need to revisit the analysis two months later to change some detail, you’ve probably forgotten all about that. Do your future self a big favour and script every part of the process!
8. Sweaving
Sweave is a so-called literate programming tool for R. That sounds very fancy, but when analysing data with R, it’s simply an easy way to organise code and output. You write the R code in a special type of file, and when it is run Sweave will gather all the output, including graphics, and organise them into a LaTeX document. Latex (I’m not going to stick to that silly capitalisation, sorry) is a free-as-in-freedom typesetting software, essentially a word processor where you write code instead of working in a graphical interface. It’s commonly used by technical folks for journal articles, reports, documentation and what not.
Latex is great, but sometimes not so intuitive — overall, though, I think both Microsoft Word and Open Office have caused me more frustration than Latex. It can do a lot of stuff, for instance, it’s particularly good with mathematical formulae. However, unless you want to make reports for publishing you’ll probably not have to learn that much of Latex. And as usual, most of the time you can just type any error message into a search engine to get help. (Two common problems are using reserved special characters, such as ”_” in the text, and making code blocks for graphics output, but forgetting to output any graphics. See below.)
In Rstudio, choose File > New > R Sweave. A few Latex codes are inserted from the beginning. This is about the bare minimum to make a Latex document. Move the cursor down below the ”\begin{document}” line, and start writing some text. When you want to enter code preface it with ”<<>>=” on its own line, and end the code with a ”@”:
\documentclass{article}
\begin{document}
\SweaveOpts{concordance=TRUE}
This is a Latex document. Text goes here.
<<>>=
data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
@
\end{document}
When you run the Sweave file, R will execute the code blocks just like a regular script file. The output will be captured and put into a Latex document together with the text you wrote around the code blocks.
Sweave files usually end in .Rnw. Running a Sweave file is almost the same as running a script:
Sweave("some_sweave_file.Rnw")
After Sweaving, R helpfully suggests that you run Latex on a .tex file that it has created. If you work on a terminal in Mac OS X, some GNU/linux distribution or similar, this translates to ”latex some_report.tex” or ”pdflatex some_report.tex”. In Rstudio, you can press the Compile PDF button. To do this, you will need to have Latex installed.
When things go wrong, Sweave will terminate and display the error message as usual. It also tells you in which code block the error occured. The blocks aren’t numbered in the .Rnw file, but with the Stangle() function, you will write out an R file where the chunks are numbered. You can open it and search for the part where the error occured. Of course, you can also run it as a script file.
As mentioned above, a code block can output graphics and insert them into the report. To specify a code block with graphics output, use ”<<fig=T>>=”. In the case of ggplot2 graphics, which we have used in this introduction, you will have to use the print() function on them for them to be displayed.
<<fig=T>>= print(qplot(x=variable, y=value, color=treatment, geom="jitter", data=melted)) @
In the text between the code blocks, you can use any Latex formatting. Rstudio can help you with some, if you press the Format button.
Finally, I put the above blocks together to this Sweave file, and got this output pdf. Neat, isn’t it?
That was two ways to organise R code. I still think it’s fine, even advisable, to spend a lot of time in interactive R trying stuff out. But at the end of the session, a script or Sweave file that runs without manual tinkering should be the goal. I usually go for a Sweave file for any piece of analysis that will output plots, tables, model calculations etc, but make scripts that I can source() for more general things that I will reuse.
Okay, enough with my preaching. Next time, we’ll try some actual statistics.
Using R: writing a table with odd lines (again)
Let’s look at my gff track headers again. Why not do it with plyr instead?
d_ply splits the data frame by the feature column and applies a nameless function that writes subsets to the file (and returns nothing, hence the ”_” in the name). This isn’t shorter or necessarily better, but it appeals to me.
library(plyr)
connection <- file("separate_tracks_2.gff", "w")
d_ply(gff, "feature", function(x) {
writeLines(paste("track name=", x$feature[1], sep=""), connection)
write.table(x, sep="\t", row.names=F, col.names=F,
quote=F, file=connection)
})
close(connection)
Using R: writing a table with odd lines (GFF track headers)
The other day, I wanted to add track lines to a GFF file, so that I could view different features as separate custom tracks in a genome browser. The need to shuffle genome coordinates between different file formats seems to occur all the time when you deal with some kind of bioinformatic data. It’s usually just text files; one just has to keep track of whether the positions should start on 0 or 1 and whether the end should include the last base or not . . .
> head(gff)
seqname source feature start end score strand 1 5 protein_coding mRNA 169010747 169031776 . + 2 5 protein_coding protein 169015421 169021641 . + 3 5 protein_coding five_prime_UTR 169010747 169010893 . + 4 5 protein_coding five_prime_UTR 169015398 169015420 . + 5 5 protein_coding CDS 169015421 169015579 . + 6 5 protein_coding CDS 169018052 169018228 . + frame group 1 . ID=ENST00000504258;Name=CCDC99-005;Parent=ENSG00000040275 2 . ID=ENSP00000421249;Name=CCDC99-005;Parent=ENST00000504258 3 . Parent=ENST00000504258 4 . Parent=ENST00000504258 5 0 Name=CDS:CCDC99;Parent=ENST00000504258 6 0 Name=CDS:CCDC99;Parent=ENST00000504258
The above example consists of a few lines from the Ensembl human database, not the actual tracks I was interested in. Anyway, this is what I did: instead of using write.table() directly, explicitly open a file for writing, first write some track line, then write the relevant subset, and repeat.
tracks <- unique(gff$feature)
connection <- file("separate_tracks.gff", "w")
for (k in 1:length(tracks)) {
writeLines(paste("track name=", tracks[k], sep=""), connection)
write.table(subset(gff, feature==tracks[k]),
sep="\t", row.names=F, col.names=F,
quote=F, file=connection)
}
close(connection)











