Archive for the ‘computer stuff’ Category
Slides from my R intro seminar
Here are my slides from a short introductory seminar on R (essentially going through part I of the R tutorial) last week. As magic lantern pictures go, they’re hideously ugly, but they were mostly there for future reference. Most of the seminar was spent showing RStudio. This Friday, we’ll practice some uses of qplot and make some linear models.
(I took out the Modern Major-General quote from the presentation, but you can enjoy him here instead. Don’t ask.)
Using R: reading tables that need a little cleaning
Sometimes one needs to read tables that are a bit messy, so that read.table doesn’t immediately recognize the content as numerical. Maybe some weird characters are sprinkled in the table (ever been given a table with significance stars in otherwise numerical columns?). Some search and replace is needed. You can do this by hand, and I know this is a task for which many people would turn to a one-liner of perl or awk, but I like to do my scripting in R.
Again, this is in the ”trivial of you only think it out” category of problems. But it gives a starting point for more subtle changes to text files that one might have to do from time to time.
Assume the data look something like this (I made this up):
id;col1;col2
id1; 0,2;0.55*
id2; -,1;,23
Thankfully, R’s conversion from numbers to text is pretty good. It will understand, for instance, that both .5 and 0.5 are numbers. If the problem was just an odd decimal separator, like the Swedish decimal comma, we could just change the dec parameter to read.table. In the above, we have mixed decimal separators. Let’s start out by reading the table as character.
character.data <- read.table("some_data.csv", sep=";",
colClasses="character", header=T)
row.names(character.data) <- character.data[,1]
character.data <- character.data[,-1]
This also stores away the sample ids in row names, and removes the first column with the ids. This is optional, but for this example I assume that all columns should be numeric. If that is not the case, the code below can of course be restricted to the numeric columns, and the character (or factor) columns can be added later.
A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors. After removing characters that aren’t numbers, decimal separators, or the minus sign, we change the decimal separator, and convert the vector to numeric. Finally, we make the list a data.frame, and propagate the row names.
library(stringr)
library(plyr)
data <- data.frame(llply(character.data, function(x) {
x <- str_replace_all(x, pattern="[^0-9\\.\\,-]", replacement="")
x <- str_replace(x, pattern="\\,", replacement=".")
return(as.numeric(x))
}))
row.names(data) <- row.names(character.data)
Using R: accessing PANTHER classifications
Importing, subsetting, merging and exporting various text files with annotation (in the wide sense, i.e. anything that might help when interpreting your experiment) is not computation and it’s not biology either, but it’s housekeeping that needs to be done. Everyone has a weapon of choice for general-purpose scripting and mine is R. Yes, this is ”quite trivial if you only think it out”, but it still takes me some time. This script is a work in progress and could certainly be much cleaner and friendlier, but I’ll post it for the benefit of any other user that might google for this kind of thing.
Panther is a protein database with phylogenetic trees of proteins annotated with Gene Ontology terms and organised into pathways. (See the paper in the 2013 database issue of NAR.) Right now, I’m after pathway classification of chicken proteins. The pathways are also available in some xml formats for systems biologists, but I’m going to use the classification table. It contains all UniProtKB proteins, so it should cover most known genes products.
Note, if you just want to check up on a few genes or a few pathways the Panther web interface seems pretty nice. If you’re after nice pathway diagrams, also check out the website and the SBML format. But for accessing classifications en masse, a treatment in R may be useful.
For this post, I’ve broken down the script into parts. If you want a function for the whole thing, see github.
First, download the classification data for your favourite organism from the Panther ftp.
panther <- read.delim(filename, sep="\t", head=F,
stringsAsFactors=F)
colnames(panther)[1] <- "gene.id"
colnames(panther)[3:5] <- c("panther.id", "panther.family",
"panther.subfamily")
colnames(panther)[6:8] <- c("go.mf", "go.bp", "go.cc")
colnames(panther)[9:10] <- c("panther.ontology", "panther.pathway")
This is a tab separated text file. Since some fields at the end of lines may be left empty, we use read.delim instead of read.table. I add som column names that I think agrees reasonably well with the readme. The first is a gene id string that contains mapping to Ensembl or Entrez ids, as well as the uniprot accession. It looks something like this:
CHICK|Gene=ENSGALG00000013995|UniProtKB=F1P447
CHICK|ENTREZ=378784|UniProtKB=Q75XU5
Of course, we want the ids as separate columns: stringr does regular expressions in a vectorised manner. str_match gets grouped matches into arrays. (Yes, it can be done with a single regexp, but I don’t take pleasure in that kind of thing.)
panther$ensembl.id <- str_match(panther$gene, "Gene=(.*)\\|")[,2] panther$uniprot.id <- str_match(panther$gene, "UniProtKB=(.*)$")[,2] panther$entrez.id <- str_match(panther$gene, "ENTREZ=(.*)\\|")[,2]
The gene ontology fields are terms from each ontology, stringed together and separated by semicolons. This is not the best format for querying, so I’ll make a list of GO ids from each ontology:
parse.go <- function(go.column) {
go.list <- str_match_all(go.column, "GO:([0-9]*)")
names(go.list) <- panther$gene.id.string
go.list <- llply(go.list, function(x) {if (!is.null(dim(x))) x[,1]})
return(go.list)
}
go.mf <- parse.go(panther$go.mf)
go.bp <- parse.go(panther$go.bp)
go.cc <- parse.go(panther$go.cc)
Finally, the pathway column. It says this in the readme:
***Example Pathway:
Inflammation mediated by chemokine and cytokine signaling pathway#Inflammation mediated by chemokine and cytokine signaling pathway#P00031>Integrin#Integrin#P00853;Integrin signalling pathway#Integrin signalling pathway#P00034>Integrin alpha#Integrin alpha#P00941The format of the pathway information is: pathway_long_name#pathway_short_name#pathway_accession>
component_long_name#component_short_name#component_accessionExplanation of pathway accessions:
Gxxxxx Gene or RNA
Pxxxxx Protein
Sxxxxx small molecules
Uxxxxx others, such as ”unknown”, etc.
For now, I’d like to keep just the pathway id, which is the first id of each pathway entry. Again, there are often multiple entries for the same gene.
pathway.list <- str_match_all(panther$panther.pathway,
"#([G,P,S,U][0-9]*)>")
names(pathway.list) <- panther$gene.id.string
pathway.list <- llply(pathway.list,
function(x) {if (!is.null(dim(x))) x[,2]})
We might want to have these additional descriptions (names of GO terms, name of pathway) later, but for now, I’ll be satisfied with the unique ids. Let’s package everything in a list, and slap a class attribute on it for good measure.
panther.classification <- list(data=panther,
go.mf=go.mf,
go.bp=go.bp,
go.cc=go.cc,
panther.pathway=pathway.list)
class(panther.classification) <- "panther.classification"
Now we can get the pathway ids associated with our favourite gene. Take for example GNB3 which has Uniprot id E1C9D6.
get.pathways <- function(panther, acc) {
ix.uni <- which(panther$data$uniprot.id == acc)
ix.ens <- which(panther$data$ensembl.id == acc)
ix.ent <- which(panther$data$entrez.id == acc)
if (length(ix.uni) > 0) {
ix <- ix.uni
} else if (length(ix.ens > 0)) {
ix <- ix.ens
} else if (length(ix.ent > 0)){
ix <- ix.ent
}
panther$panther.pathway[[ix]]
}
get.pathways(panther.classification, "E1C9D6")
I'll get back to this to maybe do something more useful with it.
More Haskell fun: the regress of a function
I thought this was pretty funny. Let’s follow the development of one part of my toy bootstrap script. (You have to keep in mind that I’m playing around with functional programming for the first time and that I’m completely utterly horrible at this!) So far, using pseudorandom draws with replacement from a list of data, my script will produce bootstrap replicates of that list. This is the part that, after the random draws have been partitioned into lists of sufficient length goes on to pull out the right elements of the data. The first iteration (heh) goes like this with explicit recursion:
applyShuffle x shuffle =
if shuffle == [] then
[]
else
[x !! head shuffle] ++ applyShuffle x (tail shuffle)
applyShuffleOrder x shuffleOrder =
if shuffleOrder == [] then
[]
else
[applyShuffle x (head shuffleOrder)] ++
applyShuffleOrder x (tail shuffleOrder)
So, a ”shuffle” in the above is a list of indices of elements of data that form a bootstrap replicate. A ”shuffle order” is a list of such lists. Next step, why not try some pattern matching? Also, the (x:xs) notation for lists makes it look a little more Haskell-y:
applyShuffle x [] = []
applyShuffle x (index:indices) =
[x !! index] ++ applyShuffle x indices
applyShuffleOrder x [] = []
applyShuffleOrder x (shuffle:shuffles) =
[applyShuffle x shuffle] ++
applyShuffleOrder x shuffles
Enter the map function! Map really just applies a function recursively to a list. (For friends of R: it’s like llply or lapply.) It’s still recursion, but it kind of reads like what I would say if I was to tell somebody what the function does: ”map the applyShuffle function to the shuffles we’ve already generated”. This is the second function again, but with map:
applyShuffleOrder x shuffles = map f shuffles where f = applyShuffle x
At this point you might realise that the functions can be easily combined into something much shorter. That is, you dear reader might have realised that before, but for me it came at this point. I also decided to stop talking about shuffle orders, and simply call them shuffles. At the end, this is all that remains of the above functions.
applyShuffles x shuffles =
map applyShuffle shuffles
where applyShuffle = map (x !!)
This is your brain on Haskell
For the last week or so I’ve been playing a little with Haskell, which seems to be great fun. At pretty low intensity, that is, in the evenings. I’ve never done anything with functional programming before, though I’ve taken courses that involved a little recursion, a bit of algorithms etc, so everything is not completely foreign to me. Well, it feels very foreign, a bit like trying to read French (I don’t read French).
Anyway, I felt like it was time to write a little program that actually does something, so I’ll try to make a script for bootstrapping. Even though bootstrapping is not the best thing in the world, a bootstrap comparison between the means of two groups seems a nice task to try: it is not completely trivial, but will make me try useful things like pseudorandom numbers and reading a csv file.
Of course, I haven’t got that far yet. Here is the script on github. (Yes, I’m also trying to get used to github.) When trying to do anything I quickly find myself thinking in terms of procedures, states and side-effects, or otherwise thinking wrong. On the other hand, yesterday evening just before sleep, I had one of those code epiphanies. ”Of course, I just need to do that . . . ” So, while the half-baked script is not that impressive, the process of writing it is probably the most fun I’ve had with a computer since Starcraft II: Wings of Liberty.
Right now, it can make bootstrap replicates of a list of integers, using a sequence of pseudorandom numbers (derived from the seed 42, so not random at all, but I’m sure you can get entropy through some monad later). I think the next thing will something to get data into the script.
I’m just remembering how to do recursion. Like this little thing, that puts data in some given order. It just handles the first element, then calls itself to deal with the rest. I think this is why I like the apply-split-combine approach in R so much, because it really feels like you’re doing almost no work at all.
applyShuffle x shuffle =
if shuffle == [] then
[]
else
[x !! head shuffle] ++ applyShuffle x (tail shuffle)
To be continued, I guess.