Archive for the ‘english’ Category
Slides and exercise from my second R intro seminar
This week I held the second introductory seminar on R, and I think it went pretty well — though I guess you really should ask my colleagues if you want to know. The first seminar was a lecture, and this seminar was a tutorial where we made some plots and calculated a few of the usual statistics. Of course the only real way to learn R is to play with it, but I hope this couple of hours provided a decent opening to getting started with R.
I actually think RStudio made it quite a bit easier. One important thing that I certainly should’ve stressed more, though, is organising code into scripts. I mentioned it in the first seminar, but I should have included it into the exercise. Maybe the first section should be something like ”Start a new script file for your analysis: Select File > New > R script to open the editor area. Save the script as unicorn_analysis.R, and for the rest of the tutorial, write your code in that window.”
Files from the seminar:
Slides about common statistical functions in R (again, ugly walls of text meant more as notes for future reference than actual slides)
Exercises for the tutorial and the associated dataset
My suggested solutions to the exercises
Slides from my R intro seminar
Here are my slides from a short introductory seminar on R (essentially going through part I of the R tutorial) last week. As magic lantern pictures go, they’re hideously ugly, but they were mostly there for future reference. Most of the seminar was spent showing RStudio. This Friday, we’ll practice some uses of qplot and make some linear models.
(I took out the Modern Major-General quote from the presentation, but you can enjoy him here instead. Don’t ask.)
From Uppsala
On a personal note, I had a great time at the Genetics of Adaptations symposium in Uppsala last Saturday. Pretty much everything was interesting, and I particularly enjoyed the following:
- Bruce Walsh himself explaining G-matrices and how they can put constraints on evolution. The G-matrix is one of those things I’d very much like to understand, and listening to someone like Bruce Walsh certainly helps. (See e.g. this paper by Walsh & Blows)
- Matt Rockman talked about some serious QTN work on awesomley weird phenotypes: worms depositing copulatory plugs on each other’s and their own heads. (See e.g. Palopoli et al.)
- Saunak Sen spoke about mapping of function-valued traits, probably the most interesting talk to me. He concentrated on traits that are functions of one variable, namely time. (See Xiong et al.) However the most interesting to me (as a gene expression enthusiast) would be traits that are, as he put it, ”massively multivariate”, like eQTL data. In that case, there’s not really an obvious analogue of time, i.e. something that the values from one individual are a function of. I eagerly await what people might come up with in that regard.
It was a really fun time, and Uppsala is always nice. I’ll have to make sure to be there when the evolution museum is open some time. I always get the feeling that I should be better at, you know, networking at these things, but I had a couple of interesting conversations about making sense of gene expression results (incidentally, something that I’m very likely to blog about in the near future).
Using R: Correlation heatmap with ggplot2
Just a short post to celebrate that I learned today how incredibly easy it is to make a heatmap of correlations with ggplot2 (and reshape2, of course).
data(attitude) library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(attitude)), fill=value, geom="tile")
So, what is going on in that short passage? cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones). melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value. We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.
A slightly different introduction to R, part IV
Now, after reading in data, making plots and organising commands with scripts and Sweave, we’re ready to do some numerical data analysis. If you’re following this introduction, you’ve probably been waiting for this moment, but I really think it’s a good idea to start with graphics and scripting before statistical calculations.
We’ll use the silly comb gnome dataset again. If you saved an Rdata file in part II, you can load it with
load("comb_gnome_data.Rdata")
If not, you can run this. Remind yourself what the changes to the melted data mean:
data <- read.csv("comb_gnome_data.csv")
library(reshape2)
melted <- melt(data, id.vars=c("id", "group", "treatment"))
melted$time <- 0
melted$time[which(melted$variable=="mass10"] <- 10
melted$time[which(melted$variable=="mass25")] <- 25
melted$time[which(melted$variable=="mass50")] <- 50
melted$id <- factor(melted$id)
We’ve already looked at some plots and figured out that there looks to be substantial differences in mass between the green and pink groups, and the control versus treatment. Let’s try to substantiate that with some statistics.
9. Mean and covariance
Just like anything in R, statistical tools are functions. Some of them come in special packages, but base R can do a lot of stuff out of the box.
Comparsion of two means: We’ve already gotten means from the mean() function and from summary(). Variance and standard deviation are calculated with var() and sd() respectively. Comparing the means betweeen two groups with a t-test or a Wilcoxon-Mann-Whitney test is done with t.test() and wilcox.test(). The functions have the word test in their names, but t-test gives not only the test statistics and p-values, but also estimates and confidence intervals. The parameters are two vectors of values of each group (i.e. a column from the subset of a data frame), and some options.
Looking back at this plot, I guess no-one is surprised by a difference in birthweigh between pink and green comb gnomes:
t.test(subset(data, group=="pink")$mass0, subset(data, group=="green")$mass0)
Welch Two Sample t-test data: subset(data, group == "pink")$mass0 and subset(data, group == "green")$mass0 t = -5.397, df = 96.821, p-value = 4.814e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -69.69294 -32.21577 sample estimates: mean of x mean of y 102.3755 153.3298
That is, we feed in two pieces of data (two vectors, really, which is what you get pulling out a column from a data frame). The above is the typical situation when you have all data points in one column and a group indicator in another. Hence you begin by subsetting the data frame to get the right rows, and pull out the right columns with the $. t.test also does paired tests, with the additional parameter paired=T.
wilcox.test(subset(data, group=="pink")$mass50, subset(data, group=="green")$mass50)
Wilcoxon rank sum test with continuity correction data: subset(data, group == "pink")$mass50 and subset(data, group == "green")$mass50 W = 605, p-value = 1.454e-05 alternative hypothesis: true location shift is not equal to 0
Recalling histograms for the comb gnome weights, the use of the Wilcoxon-Mann-Whitney for masses at tim 50 and a t-test for the masses at birth (t=0) probably makes sense. However, we probably want to make use of all the time points together rather than doing a test for each time point, and we also want to deal with both the colour and the treatment at the same time.
Before we get there, let’s look at correlation:
cor(data$mass10, data$mass25) cor(data$mass0, data$mass50, method="spearman") cor.test(data$mass10, data$mass25)
The cor() function gives you correlation coefficients, both Pearson, Spearman and Kendall. If you want the covariance, cov() is the function for that. cor.test() does associated tests and confidence intervals. One thing to keep in mind is missing values. This data set is complete, but try this:
some.missing <- data$mass0 some.missing[c(10, 20, 30, 40:50)] <- NA cor(some.missing, data$mass25) cor(some.missing, data$mass10, use="pairwise")
The use parameter decides what values R should include. The default is all, but we can choose pairwise complete observations instead.
If you have a big table of variables that you’d like to correlate with each other, the cor() function works for them as well. (Not cor.test(), though. However, the function can be applied across the rows of a data frame. We’ll return to that.)
10. A couple of simple linear models
Honestly, most of the statistics in biology is simply linear models fit with least squares and tested with a normal error model. A linear model looks like this
yi = b0 + b1x1i + b2x2i + … bnxni + ei
where y is the response variable, the xs are predictors, i is an index over the data points, and ei are the errors. The error is the only part of the equations that is a random variable. b0, …, bn are the coefficients — your main result, showing how the mean difference in the response variable between data points with different values of the predictors. The coefficients are fit by least squares, and by estimating the variance of the error term, we can get some idea of the uncertainty in the coefficients.
Regression coefficients can be interpreted as predictions about future values or sometimes even as causal claims (depending on other assumptions), but basically, they describe differences in mean values.
This is not a text on linear regression — there are many of those; may I suggest the books by Faraway or Gelman and Hill — suffice to say that as long as the errors are independent and have equal variance, least squares is the best unbiased estimate. If we also assume that the errors are normally distributed, the least squares is also the maximum likelihood estimate. (And it’s essentially the same as a Bayesian version of the linear model with vague priors, just for the record.)
In R, the lm() function handles linear models. The model is entered as a formula of the type response ~ predictor + predictor * interacting predictors. The error is implicit, and assumed to be normally distributed.
model <- lm(mass0 ~ group + treatment, data=data) summary(model)
Call: lm(formula = mass0 ~ group + treatment, data = data) Residuals: Min 1Q Median 3Q Max -86.220 -32.366 -2.847 35.445 98.417 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 141.568 7.931 17.850 < 2e-16 *** grouppink -49.754 9.193 -5.412 4.57e-07 *** treatmentpixies 23.524 9.204 2.556 0.0122 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 45.67 on 96 degrees of freedom Multiple R-squared: 0.28, Adjusted R-squared: 0.265 F-statistic: 18.67 on 2 and 96 DF, p-value: 1.418e-07
The summary gives the coefficients, their standard errors, the p-value of a t-test of the regression coefficient, and R squared for the model. Factors are encoded as dummy variables, and R has picked the green group and the control treatment as baseline so the coefficient ”grouppink” describes how the mean of the pink group differs from the green. Here are the corresponding confidence intervals:
confint(model)
2.5 % 97.5 % (Intercept) 125.825271 157.31015 grouppink -68.001759 -31.50652 treatmentpixies 5.254271 41.79428
(These confidence intervals are not adjusted to control the family-wise error rate, though.) With only two factors, the above table is not that hard to read, but let’s show a graphical summary. Jared Lander’s coefplot gives us a summary of the coefficients:
install.packages("coefplot") ##only the first time
library(coefplot)
coefplot(model)
The bars are 2 standard deviations. This kind of plot gives us a quick look at the coefficients, and whether they are far from zero (and therefore statistically significant). It is probably more useful for models with many coefficients.
There is a bunch of diagnostic plots that you can make to check for gross violations of the above assumptions of the linear model. Two useful ones are the normal quantile-quantile plot of residuals, and the residuals versus fitted scatterplot:
library(ggplot2) qplot(sample=residuals(model), stat="qq")
The quantile plot compares the distribution of the residual to the quantiles of a normal distribution — if the residuals are normally distributed it will be a straight line.
qplot(fitted(model), residuals(model))
Variance should be roughly equal fitted values, and there should not be obvious patterns in the data.
If these plots look terrible, a common approach is to try to find a transformation of the data that allows the linear model to be used anyway. For instance, it often helps to take the logarithm of the response variable. Why is that so useful? Well, with some algebraic magic:
log(yi) = b0 + b1x1i + b2x2i + … + bnxni + ei, and as long as no y:s are zero,
yi = exp(b0) * exp(b1x1i) * exp(b2x2i) * .. * exp(bnxni) * exp(ei)
We have gone from a linear model to a model where the b:s and x:es multiplied together. For some types of data, this will stabilise the variance of the errors, and make the distribution closer to a normal distribution. It’s by no means a panacea, but in the comb gnome case, I hope the plots we made in part II have already convinced you that an exponential function might be involved.
Let’s look at a model where these plots look truly terrible: the weight at time 50.
model.50 <- lm(mass50 ~ group + treatment, data=data) qplot(sample=residuals(model.50), stat="qq") qplot(fitted(model.50), residuals(model.50))
Let’s try the log transform:
model.log.50 <- lm(log(mass50) ~ group + treatment, data=data) qplot(sample=residuals(model.log.50), stat="qq") qplot(fitted(model.log.50), residuals(model.log.50)) coefplot(model.log.50)
In both the above models both predictors are categorical. When dealing with categorical predictors, you might prefer the analysis of variance formalism. Anova is the same kind of linear model as regression (but usually parameterised slightly differently), followed by F-tests to check whether each predictor explains a significant amount of the variance in the response variable. To see the Anova table for a linear model in R, do this:
comb.gnome.anova <- aov(log(mass50) ~ group + treatment, data=data) summary(comb.gnome.anova)
Df Sum Sq Mean Sq F value Pr(>F) group 1 35.87 35.87 73.26 1.84e-13 *** treatment 1 90.76 90.76 185.36 < 2e-16 *** Residuals 96 47.00 0.49 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
More Haskell: a bootstrap
So my playing around with Haskell goes on. You can follow the progress of the little bootstrap exercise on github. Now it’s gotten to the point where it actually does a bootstrap interval for the mean of a sample. Consider the following R script:
n <- 100
fake.data <- data.frame(group=rep(1, n), data=rpois(n, 10))
write.table(fake.data, quote=F, row.names=F, col.names=F,
sep=",", file="fake_data.csv")
library(plyr)
bootstrap.replicates <- llply(vector("list", 100),
sample, x=fake.data$data,
replace=T, size=n)
bootstrap.means <- unlist(llply(bootstrap.replicates, mean))
print(mean(fake.data$data))
print(quantile(bootstrap.means, c(0.025, 0.975)))
[1] 10.31
2.5% 97.5%
9.72475 10.85200
So, that was a simple bootstrap in R: we get some draws from a Poisson distribution, sample 100 times from the data with replacement, and summarise the replicates. This is my Haskell thing running in GHCi:
*Main> main "boot" "will eventually bootstrap, if martin knows his stuff" fake_data.csv [8,6,11,16,5,11,12,12,7,9,13,13,12,7,13,7,7,11,9,14,14,13,10,14,17,12,8, 10,15,12,13,13,7,10,9,6,7,8,10,12,10,10,10,12,11,8,16,12,13,13,12,15,7, 7,8,9,5,7,13,10,12,11,8,6,12,14,12,14,6,9,10,9,10,6,9,7,6,12,13,7,11,7, 13,15,10,10,9,12,12,6,10,6,8,10,13,8,9,13,12,13] 10.31 (9.8,10.83)
It’s certainly not the prettiest thing in the world (for one thing, it will crash if there is an extra line break at the end of the file). Next stop: type declarations! Haskell will infer the types for me, but it is probably a good idea to declare the intended types. Or at least to be able to do so is. Then the plan is to make some use of the first column in the data file, i.e. group the sample belongs to, to add a second sample and make a comparison between the means. And then it’s pretty much done and maybe I’ll move on to something more useful. I’m thinking that implementing least squares linear models would be a decent exercise?
Using R: accessing PANTHER classifications
Importing, subsetting, merging and exporting various text files with annotation (in the wide sense, i.e. anything that might help when interpreting your experiment) is not computation and it’s not biology either, but it’s housekeeping that needs to be done. Everyone has a weapon of choice for general-purpose scripting and mine is R. Yes, this is ”quite trivial if you only think it out”, but it still takes me some time. This script is a work in progress and could certainly be much cleaner and friendlier, but I’ll post it for the benefit of any other user that might google for this kind of thing.
Panther is a protein database with phylogenetic trees of proteins annotated with Gene Ontology terms and organised into pathways. (See the paper in the 2013 database issue of NAR.) Right now, I’m after pathway classification of chicken proteins. The pathways are also available in some xml formats for systems biologists, but I’m going to use the classification table. It contains all UniProtKB proteins, so it should cover most known genes products.
Note, if you just want to check up on a few genes or a few pathways the Panther web interface seems pretty nice. If you’re after nice pathway diagrams, also check out the website and the SBML format. But for accessing classifications en masse, a treatment in R may be useful.
For this post, I’ve broken down the script into parts. If you want a function for the whole thing, see github.
First, download the classification data for your favourite organism from the Panther ftp.
panther <- read.delim(filename, sep="\t", head=F,
stringsAsFactors=F)
colnames(panther)[1] <- "gene.id"
colnames(panther)[3:5] <- c("panther.id", "panther.family",
"panther.subfamily")
colnames(panther)[6:8] <- c("go.mf", "go.bp", "go.cc")
colnames(panther)[9:10] <- c("panther.ontology", "panther.pathway")
This is a tab separated text file. Since some fields at the end of lines may be left empty, we use read.delim instead of read.table. I add som column names that I think agrees reasonably well with the readme. The first is a gene id string that contains mapping to Ensembl or Entrez ids, as well as the uniprot accession. It looks something like this:
CHICK|Gene=ENSGALG00000013995|UniProtKB=F1P447
CHICK|ENTREZ=378784|UniProtKB=Q75XU5
Of course, we want the ids as separate columns: stringr does regular expressions in a vectorised manner. str_match gets grouped matches into arrays. (Yes, it can be done with a single regexp, but I don’t take pleasure in that kind of thing.)
panther$ensembl.id <- str_match(panther$gene, "Gene=(.*)\\|")[,2] panther$uniprot.id <- str_match(panther$gene, "UniProtKB=(.*)$")[,2] panther$entrez.id <- str_match(panther$gene, "ENTREZ=(.*)\\|")[,2]
The gene ontology fields are terms from each ontology, stringed together and separated by semicolons. This is not the best format for querying, so I’ll make a list of GO ids from each ontology:
parse.go <- function(go.column) {
go.list <- str_match_all(go.column, "GO:([0-9]*)")
names(go.list) <- panther$gene.id.string
go.list <- llply(go.list, function(x) {if (!is.null(dim(x))) x[,1]})
return(go.list)
}
go.mf <- parse.go(panther$go.mf)
go.bp <- parse.go(panther$go.bp)
go.cc <- parse.go(panther$go.cc)
Finally, the pathway column. It says this in the readme:
***Example Pathway:
Inflammation mediated by chemokine and cytokine signaling pathway#Inflammation mediated by chemokine and cytokine signaling pathway#P00031>Integrin#Integrin#P00853;Integrin signalling pathway#Integrin signalling pathway#P00034>Integrin alpha#Integrin alpha#P00941The format of the pathway information is: pathway_long_name#pathway_short_name#pathway_accession>
component_long_name#component_short_name#component_accessionExplanation of pathway accessions:
Gxxxxx Gene or RNA
Pxxxxx Protein
Sxxxxx small molecules
Uxxxxx others, such as ”unknown”, etc.
For now, I’d like to keep just the pathway id, which is the first id of each pathway entry. Again, there are often multiple entries for the same gene.
pathway.list <- str_match_all(panther$panther.pathway,
"#([G,P,S,U][0-9]*)>")
names(pathway.list) <- panther$gene.id.string
pathway.list <- llply(pathway.list,
function(x) {if (!is.null(dim(x))) x[,2]})
We might want to have these additional descriptions (names of GO terms, name of pathway) later, but for now, I’ll be satisfied with the unique ids. Let’s package everything in a list, and slap a class attribute on it for good measure.
panther.classification <- list(data=panther,
go.mf=go.mf,
go.bp=go.bp,
go.cc=go.cc,
panther.pathway=pathway.list)
class(panther.classification) <- "panther.classification"
Now we can get the pathway ids associated with our favourite gene. Take for example GNB3 which has Uniprot id E1C9D6.
get.pathways <- function(panther, acc) {
ix.uni <- which(panther$data$uniprot.id == acc)
ix.ens <- which(panther$data$ensembl.id == acc)
ix.ent <- which(panther$data$entrez.id == acc)
if (length(ix.uni) > 0) {
ix <- ix.uni
} else if (length(ix.ens > 0)) {
ix <- ix.ens
} else if (length(ix.ent > 0)){
ix <- ix.ent
}
panther$panther.pathway[[ix]]
}
get.pathways(panther.classification, "E1C9D6")
I'll get back to this to maybe do something more useful with it.
Weird juxtapositions happen when you import Wikipedia
The network is available on IntegromeDB public database (http://integromedb.org) under the present manuscript title.
So I went there:
Apparently, typing in journal article titles was not what the search field was for. Couldn’t find the network either, but the article is still in provisional pdf form so that may be the reason.
Dragana Stanley, Nathan S Watson-Haigh, Christopher JE Cowled, Robert J Moore. (2013) Genetic architecture of gene expression in the chicken. BMC Genomics 14
More Haskell fun: the regress of a function
I thought this was pretty funny. Let’s follow the development of one part of my toy bootstrap script. (You have to keep in mind that I’m playing around with functional programming for the first time and that I’m completely utterly horrible at this!) So far, using pseudorandom draws with replacement from a list of data, my script will produce bootstrap replicates of that list. This is the part that, after the random draws have been partitioned into lists of sufficient length goes on to pull out the right elements of the data. The first iteration (heh) goes like this with explicit recursion:
applyShuffle x shuffle =
if shuffle == [] then
[]
else
[x !! head shuffle] ++ applyShuffle x (tail shuffle)
applyShuffleOrder x shuffleOrder =
if shuffleOrder == [] then
[]
else
[applyShuffle x (head shuffleOrder)] ++
applyShuffleOrder x (tail shuffleOrder)
So, a ”shuffle” in the above is a list of indices of elements of data that form a bootstrap replicate. A ”shuffle order” is a list of such lists. Next step, why not try some pattern matching? Also, the (x:xs) notation for lists makes it look a little more Haskell-y:
applyShuffle x [] = []
applyShuffle x (index:indices) =
[x !! index] ++ applyShuffle x indices
applyShuffleOrder x [] = []
applyShuffleOrder x (shuffle:shuffles) =
[applyShuffle x shuffle] ++
applyShuffleOrder x shuffles
Enter the map function! Map really just applies a function recursively to a list. (For friends of R: it’s like llply or lapply.) It’s still recursion, but it kind of reads like what I would say if I was to tell somebody what the function does: ”map the applyShuffle function to the shuffles we’ve already generated”. This is the second function again, but with map:
applyShuffleOrder x shuffles = map f shuffles where f = applyShuffle x
At this point you might realise that the functions can be easily combined into something much shorter. That is, you dear reader might have realised that before, but for me it came at this point. I also decided to stop talking about shuffle orders, and simply call them shuffles. At the end, this is all that remains of the above functions.
applyShuffles x shuffles =
map applyShuffle shuffles
where applyShuffle = map (x !!)
This is your brain on Haskell
For the last week or so I’ve been playing a little with Haskell, which seems to be great fun. At pretty low intensity, that is, in the evenings. I’ve never done anything with functional programming before, though I’ve taken courses that involved a little recursion, a bit of algorithms etc, so everything is not completely foreign to me. Well, it feels very foreign, a bit like trying to read French (I don’t read French).
Anyway, I felt like it was time to write a little program that actually does something, so I’ll try to make a script for bootstrapping. Even though bootstrapping is not the best thing in the world, a bootstrap comparison between the means of two groups seems a nice task to try: it is not completely trivial, but will make me try useful things like pseudorandom numbers and reading a csv file.
Of course, I haven’t got that far yet. Here is the script on github. (Yes, I’m also trying to get used to github.) When trying to do anything I quickly find myself thinking in terms of procedures, states and side-effects, or otherwise thinking wrong. On the other hand, yesterday evening just before sleep, I had one of those code epiphanies. ”Of course, I just need to do that . . . ” So, while the half-baked script is not that impressive, the process of writing it is probably the most fun I’ve had with a computer since Starcraft II: Wings of Liberty.
Right now, it can make bootstrap replicates of a list of integers, using a sequence of pseudorandom numbers (derived from the seed 42, so not random at all, but I’m sure you can get entropy through some monad later). I think the next thing will something to get data into the script.
I’m just remembering how to do recursion. Like this little thing, that puts data in some given order. It just handles the first element, then calls itself to deal with the rest. I think this is why I like the apply-split-combine approach in R so much, because it really feels like you’re doing almost no work at all.
applyShuffle x shuffle =
if shuffle == [] then
[]
else
[x !! head shuffle] ++ applyShuffle x (tail shuffle)
To be continued, I guess.










