There is grandeur in this view of life

martins bioblogg

Archive for the ‘computer stuff’ Category

Finding the distance from ChIP signals to genes

with one comment

I’ve had a couple of months off from blogging. Time for some computer-assisted biology! Robert Griffin asks on Stack Exchange about finding the distance between HP1 binding sites and genes in Drosophila melanogaster.  We can get a rough idea with some public chromatin immunoprecipitation data, R and the wonderful BEDTools.

Finding some binding sites

There are indeed some ChIP-seq datasets on HP1 available. I looked up these ones from modENCODE: modENCODE_3391 and modENCODE_3392, using two different antibodies for Hp1b in 16-24 h old embroys. I’m not sure since the modENCODE site doesn’t seem to link datasets to publications, but I think this is the paper where the results are reported: A cis-regulatory map of the Drosophila genome (Nègre & al 2011).

What they’ve done, in short, is cross-linking with formaldehyde, sonicate DNA into fragments, capture fragments with either of the two antibodies and sequence those fragments. They aligned reads with Eland (Illumina’s old proprietary aligner) and called peaks (i.e. regions where there is a lot of reads, which should reflect regions bound by Hp1b) with MACS. We can download their peaks in general feature format.

I don’t know whether there is any way to make completely computation predictions of Hp1 binding sites but I doubt it.

Some data cleaning

The files are available from ftp, and for the below analysis I’ve unzipped them and called them modENCODE_3391.gff3 and modENCODE_3392.gff3. GFF is one of all those tab separated text files that people use for genomic coordinates. If you do any bioinformatics type work you will have to convert back and forth between them and I suggest bookmarking the UCSC Genome Browser Format FAQ.

Even when we trust in their analysis, some processing of files is always required. In this case, MACS sometimes outputs peaks with negative start coordinates in the beginning of a chromosome. BEDTools will have none of that, because ”malformed GFF entry at line … Start was greater than end”. In this case, it happens only at a few lines, and I decided to set those start coordinates to 1 instead.

We need a small script to solve that. As I’ve written before, any language will do, but I like R and tend to do my utility scripting in R (and bash). If the files were incredibly huge and didn’t fit in memory, we’d have to work through the files line by line or chunk by chunk. But in this case we can just read everything at once and operate on it with vectorised R commands, and then write the table again.

modENCODE_3391 <- read.table("modENCODE_3391.gff3", stringsAsFactors=F, sep="\t")
modENCODE_3392 <- read.table("modENCODE_3392.gff3", stringsAsFactors=F, sep="\t")

fix.coord <- function(gff) {
  gff$V4[which(gff$V4 < 1)] <- 1

write.gff <- function(gff, file) {
  write.table(gff, file=file, row.names=F, col.names=F,
              quote=F, sep="\t")

write.gff(fix.coord(modENCODE_3391), file="cleaned_3391.gff3")
write.gff(fix.coord(modENCODE_3392), file="cleaned_3392.gff3")

Flybase transcripts

To find the distance to genes, we need to know where the genes are. The best source is probably the annotation made by Flybase, which I downloaded from the Ensembl ftp in General transfer format (GTF, which is close enough to GFF that we don't have to care about the differences right now).

This file contains a lot of different features. We extract the transcripts and find where the transcript model starts, taking into account whether the transcript is in the forward or reverse direction (this information is stored in columns 4, 5 and 7 of the GTF file). We store this in a new GTF file of transcript start positions, which is the one we will feed to BEDTools:

ensembl <- read.table("Drosophila_melanogaster.BDGP5.75.gtf",
                      stringsAsFactors=F, sep="\t")

transcript <- subset(ensembl, V3=="transcript")
transcript.start <- transcript
transcript.start$V3 <- "transcript_start"
transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+",
                                                     transcript$V4, transcript$V5)

write.gff(transcript.start, file="ensembl_transcript_start.gtf")

Finding distance with BEDTools

Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It's a command line tool, and if you haven't already you will need to download and compile it, which I recall being painless.

bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ">" to capture it in a text file.

Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file.

Rscript clean_files.R

bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \
    > closest_element_3391.txt
bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \
    > closest_element_3392.txt

Some results

With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site.

closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t")

qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) +
  facet_wrap(~V2, scale="free_y")


Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks.

closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t")

combined <- merge(closest_3391, closest_3392,
                  by.x=c("V1", "V2","V4", "V5", "V9"),
                  by.y=c("V1", "V2","V4", "V5", "V9"))

qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)


(If you're wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.)

About the DGRP

The question mentioned the DGRP. I don't know that anyone has looked at ChIP in the DGRP lines, but wouldn't that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL ... What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don't know if that would tell you much -- does anyone know what kind of variant would affect Hp1 binding?

Happy hacking!

Written by mrtnj

4 juli, 2014 at 20:08

More fun with %.% and %>%

with 6 comments

The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:

data %.%
    melt(id.vars=c("treatment", "sex")) %.%
    group_by(sex, treatment, variable) %.%

I haven't figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.

data %.% lm(formula=response1 ~ factor(sex)) %.% summary()

As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let's look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put "." where the data is supposed to go. This means that the data doesn't have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:

data %>% lm(response1 ~ factor(sex), .) %>% summary()

Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I'm sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I've gotten used to it and decided what I like and don't like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I'm sure it will turn out to be a useful development.

Written by mrtnj

27 mars, 2014 at 21:42

Publicerat i computer stuff, data analysis, english

Tagged with ,

Using R: quickly calculating summary statistics (with dplyr)

with 2 comments

I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.

So, here comes the code to do the thing we did yesterday but with dplyr:

## The code for the toy data is exactly the same
data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

## reshape2 still does its thing:
melted <- melt(data, id.vars=c("sex", "treatment"))

## This part is new:
grouped <- group_by(melted, sex, treatment)
summarise(grouped, mean=mean(value), sd=sd(value))

When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:

Source: local data frame [4,000 x 4]
Groups: sex, treatment, variable

   sex treatment  variable       value
1    1         1 response1 -0.15668214
2    1         2 response1 -0.40934759
3    1         1 response1  0.07103731
4    1         2 response1  0.15113270
5    1         1 response1  0.30836910
6    1         2 response1 -1.41891407
7    1         1 response1 -0.07390246
8    1         2 response1 -1.34509686
9    1         1 response1  1.97215697
10   1         2 response1 -0.08145883

The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.

The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.

Source: local data frame [8 x 5]
Groups: sex, treatment

  sex treatment  variable         mean        sd
1   1         1 response1  0.021856280 1.0124371
2   1         1 response2  0.045928150 1.0151670
3   1         2 response1 -0.065017971 0.9825428
4   1         2 response2  0.011512867 0.9463053
5   2         1 response1 -0.005374208 1.0095468
6   2         1 response2 -0.051699624 1.0154782
7   2         2 response1  0.046622111 0.9848043
8   2         2 response2 -0.055257295 1.0134786

Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:

summarise(group_by(melted, sex, treatment, variable),
          mean=mean(value), sd=sd(value))

melted %.% group_by(sex, treatment, variable) %.%
       summarise(mean=mean(value), sd=sd(value))

The first one is nothing special: we've just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you're used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn't seem to be any consensus yet, but I'm looking forward to a future where we can write points-free R.

Written by mrtnj

26 mars, 2014 at 23:38

Publicerat i computer stuff, data analysis, english

Tagged with ,

Using R: quickly calculating summary statistics from a data frame

with 8 comments

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))
  sex treatment   response1   response2
1   1         1 -0.15668214 -0.13663012
2   1         2 -0.40934759 -0.07220426
3   1         1  0.07103731 -2.60549018
4   1         2  0.15113270  1.81803178
5   1         1  0.30836910  0.32596016
6   1         2 -1.41891407  1.12561812

Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you've got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)/
  sqrt(nrow(subset(data, sex == 1 & treatment == 1)))

Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that's where plyr and reshape2 come in. We load the packages. The first time you'll have to run install.packages, as usual.


First out, the melt function from rehape2. Look at the table above. It's reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn't seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))

The resulting "melted" table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

  sex treatment  variable       value
1   1         1 response1 -0.15668214
2   1         2 response1 -0.40934759
3   1         1 response1  0.07103731
4   1         2 response1  0.15113270
5   1         1 response1  0.30836910
6   1         2 response1 -1.41891407

Now it's time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we'll use plyr to automatically apply them to all the subsets we're interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It's probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we're done!

ddply(melted, c("sex", "treatment", "variable"), summarise,
      mean = mean(value), sd = sd(value),
      sem = sd(value)/sqrt(length(value)))
  sex treatment  variable         mean        sd        sem
1   1         1 response1  0.021856280 1.0124371 0.04527757
2   1         1 response2  0.045928150 1.0151670 0.04539965
3   1         2 response1 -0.065017971 0.9825428 0.04394065
4   1         2 response2  0.011512867 0.9463053 0.04232006
5   2         1 response1 -0.005374208 1.0095468 0.04514830
6   2         1 response2 -0.051699624 1.0154782 0.04541357
7   2         2 response1  0.046622111 0.9848043 0.04404179
8   2         2 response2 -0.055257295 1.0134786 0.04532414

Written by mrtnj

25 mars, 2014 at 23:32

Publicerat i computer stuff, data analysis, english

Tagged with , , ,

Using R: barplot with ggplot2

with one comment

Ah, the barplot. Loved by some, hated by some, the first graph you’re likely to make in your favourite office spreadsheet software, but a rather tricky one to pull off in R. Or, that depends. If you just need a barplot that displays the value of each data point as a bar — which is one situation where I like a good barplot — the barplot( ) function does just that: <- rnorm(10, 4, 1.5)
names( <- 1:10


Done? Not really. The barplot (I know some people might not use the word plot for this type of diagram, but I will) one typically sees from a spreadsheet program has some gilding: it's easy to get several variables ("series") of data in the same plot, and often you'd like to see error bars. All this is very possible in R, either with base graphics, lattice or ggplot2, but it requires a little more work. As usual when it gets a bit more fancy, I prefer ggplot2 over the alternatives. Once upon a time when I started with ggplot2, I tried googling for this, and lots of people have answered this question. I was still confused, though. So, if you're a new user and reading this, please bear with me and I'll try to demonstrate what all the steps are good for. Whether it's a good statistical graph or not, the barplot is actually a nice example of ggplot2 in action and will demonstrate some R principles.

Let us take an example: Say that we start with a pretty typical small dataset with two variables that we've measured in four groups. Now we'd like a barplot of the group means and error bars for the means.

0. Start a script

Making the plot will take more than a couple of lines, so it's a good idea to put everything in a script. Below I will split the script into chunks, but the whole thing is on github. We make a new R file and load ggplot2, plyr and reshape2, the packages we will need:


1. Simulate some data

In the case of real barplot this is where you load your data. You will probably have it in a text file that you read with the read.table( ) family of functions or RStudios Import dataset button (which makes the read.table call for you; if you don't feel like late nights hunched over the read.table manual page, I recommend it). Simulating data might look something like this:

n <- 10
group <- rep(1:4, n)
mass.means <- c(10, 20, 15, 30)
mass.sigma <- 4
score.means <- c(5, 5, 7, 4)
score.sigma <- 3
mass <- as.vector(model.matrix(~0+factor(group)) %*% mass.means) +
  rnorm(n*4, 0, mass.sigma)
score <- as.vector(model.matrix(~0+factor(group)) %*% score.means) +
  rnorm(n*4, 0, score.sigma)
data <- data.frame(id = 1:(n*4), group, mass, score)

This code is not the tersest possible, but still a bit tricky to read. If you only care about the barplot, skip over this part. We define the number of individuals per group (10), create a predictor variable (group), set the true mean and standard deviation of each variable in each group and generate values from them. The values are drawn from a normal distribution with the given mean and standard deviation. The model.matrix( ) function returns a design matrix, what is usually called X in a linear model. The %*% operator is R's way of denoting matrix multiplication -- to match the correct mean with the predictor, we multiply the design matrix by the vector of means. Now that we've got a data frame, we pretend that we don't know the actual values set above.

  id group       mass    score
1  1     1  4.2367813 5.492707
2  2     2 16.4357254 1.019964
3  3     3 19.2491831 6.936894
4  4     4 23.4757636 3.845321
5  5     1  0.9533737 1.852927
6  6     2 19.9142350 5.567024

2. Calculate means

The secret to a good plot in ggplot2 is often to start by rearranging the data. Once the data is in the right format, mapping the columns of the data frame to the right element of the plot is the easy part. In this case, what we want to plot is not the actual data points, but a function of them -- the group means. We could of course subset the data eight times (four groups times two variables), but thankfully, plyr can do that for us. Look at this piece of code:

melted <- melt(data, id.vars=c("id", "group"))
means <- ddply(melted, c("group", "variable"), summarise,

First we use reshape2 to melt the data frame from tabular form to long form. The concept is best understood by comparing the output and input of melt( ). Compare the rows above to these rows, which are from the melted data frame:

   id group variable      value
1   1     1     mass  4.2367813
2   2     2     mass 16.4357254
3   3     3     mass 19.2491831
4   4     4     mass 23.4757636

We've gone from storing two values per row (mass and score) to storing one value (mass or score), keeping the identifying variables (id and group) in each row. This might seem tricky (or utterly obvious if you've studied database design), but you'll soon get used to it. Trust me, if you do, it will prove useful!

The second row uses ddply ("apply from data frame to data frame") to split up the melted data by all combinations of group and variable and calculate a function of the value, in this case the mean. The summarise function creates a new data frame from an old; the arguments are the new columns to be calculated. That is, it does exactly what it says, summarises a data frame. If you're curious, try using it directly. It's not very useful on its own, but very good in ddply calls.

3. Barplot of the means

Time to call on ggplot2! One has a choice between using qplot( ) or ggplot( ) to build up a plot, but qplot is the easier. We map the mean to y, the group indicator to x and the variable to the fill of the bar. The bar geometry defaults to counting values to make a histogram, so we need to tell use the y values provided. That's what setting stat= to "identity" is good for. To make the bars stand grouped next to each other instead of stacking, we tell set position=.

means.barplot <- qplot(x=group, y=mean, fill=variable,
                       data=means, geom="bar", stat="identity",


4. Standard error of the mean

Some people can argue for hours about error bars. In some cases you will want other types of error bars. Maybe the inferences come from a hierarchical model where the standard errors are partially pooled. Maybe you're dealing with some type of generalised linear model or a model made with transformed data. See my R tutorial for a simple example with anova. The point is that from the perspective of ggplot2 input to the error bars is data, just like anything else, and we can use the full arsenal of R tools to create them.

means.sem <- ddply(melted, c("group", "variable"), summarise,
                   mean=mean(value), sem=sd(value)/sqrt(length(value)))
means.sem <- transform(means.sem, lower=mean-sem, upper=mean+sem)

First, we add a standard error calculation to the ddply call. The transform function adds colums to a data frame; we use it to calculate the upper and lower limit to the error bars (+/- 1 SEM). Then back to ggplot2! We add a geom_errorbar layer with the addition operator. This reveals some of the underlying non-qplot syntax of ggplot2. The mappings are wrapped in the aes( ), aesthetics, function and the other settings to the layer are regular arguments. The data argument is the data frame with interval limits that we made above. The only part of this I don't like is the position_dodge call. What it does is nudge the error bars to the side so that they line up with the bars. If you know a better way to get this behaviour without setting a constant, please write me a comment!

means.barplot + geom_errorbar(aes(ymax=upper,


Does this seem like a lot of code? If we look at the actual script and disregard the data simulation part, I don't think it's actually that much. And if you make this type of barplot often, you can package this up into a function.

Written by mrtnj

19 mars, 2014 at 22:25

Morning coffee: scripting language

with 3 comments

Several people have asked: what scripting language should biologists learn if they are interested in doing a little larger-scale data analysis and have never programmed before? I’m not an expert, but these are the kinds of things I tend to say:

The language is not so important; the same principles apply everywhere. Use what your friends and colleagues use so you can get help from them. I believe most people would answer Python. I would answer R. Don’t believe people who tell you that R is not a serious language. You’re already familiar with analysing small datasets in a statistics program. You can do that in R too, and then the step to writing code and handling larger projects is actually very short. Your data will very likely come in tables, and R is very good at that. You’ll also want pretty graphs, and R is very good at that too. Regardless, have a look at the other common languages as well. Practice working from a terminal.

Written by mrtnj

13 mars, 2014 at 08:04

Using R: common errors in table import

with 4 comments

I’ve written before about importing tabular text files into R, and here comes some more. This is because I believe (firmly) that importing data is the major challenge for beginners who want to analyse their data in R. What is the most important thing about using any statistics software? To get your data into it in the first place! Unfortunately, no two datasets are the same and many frustrations await the beginner. Therefore, let me share a few things that I’ve picked up while trying to read data into R.

General tip: keep it as simple as possible. Open your favourite spreadsheet software, remove all the colours and formatting, make sure to have only one header row with short and simple column names (avoid any special characters; R will turn them to full stops), remove empty columns above and to the left of the table and export the sheet in to plain text, tab separated or comma separated. Then use either read.table, read.csv or the Import dataset button in RStudio to read your table, and in case of doubt, begin with the default settings, which are often sensible. If you see any annoying data import errors, below are a few possible causes.

If you don’t get the number of columns you expect

Then it’s often a separator problem. read.table allows you to set the filed separator character (with sep=) and the decimal separator (with dec=). To get a tab character, use ”\t”:

data <- read.table(file="data.txt", sep="\t", dec=",")

In many countries this is not an issue, but the Swedish standard is using a comma as decimal separator, while R uses a decimal point. If you export a comma separated value file on a Swedish computer, you are likely to get a numbers with decimal commas and semicolons as field separators. Then what you want is read.csv2( ), which is a read.csv made for semicolon separated files with decimal commas.

As an additional benefit of using it, RStudio solves those issues for you with the Import dataset button. The dialog box that appears when you click it lets you choose separators, header line (yes/no) and whether there are quotes around fields, and shows a preview of what the table will look like. Then it builds the read.table command for you, so that you can copy it into your script. If you're not an RStudio user, just look at the actual file in a text editor, and add dec= and sep= instructions to your read.table as needed.

If you see a lot more columns than you expect, usually after the ones you expect, filled with NAs, it's probably because you've happened to enter some character (very likely a whitespace ...) somewhere to the right in the spreadsheet. When exported, the software will fill in all the empty cells, which R will interpret as missing values. If this happens, make sure to delete the contents of the spreadsheet after the columns you're interested in.

Whitespaces can also be a problem if you've specified a custom separator, like above. Normally, read.table is very good about skipping over whitespace, should you have happened to put an extra whitespace in before that factor level. That functionality is turned off when you specify a separator, though, so you might need to switch it on again by setting strip.white=T in the function call. So, for instance, " male" and "female  " are interpreted as male/female, which is probably what one wants. This cost me several grey hairs and string operations before I came to my senses and read the data import manual.

If columns are not the type you expect

If something is a character when it should be numeric you might see messages such as "'x' must be numeric " or "non-numeric argument to binary operator". If something is a factor when it should be character, some character operations might fail. Regardless, it could still be a separator problem. If R expects to see decimal points and sees a comma, it will make that column a character vector rather then a numeric column. The principle is that read.table it will try to turn the column into numeric or integer (or complex or logical, but they are rarer). Most often, this means that if you feed it numbers it will make them numeric. But if there is anything else in any of the elements, it will assume character -- and by default it will convert character vector into factors.

There are several reasons why you might have text among the numbers, making R not interpret the column as numeric. We've mentioned wrong separators, but sometimes if you paste spreadsheets from different sources together you end up with inconsistent field separators. I've written a post about a way to deal with that in some cases, but you can also search and replace the file with a text editor, or use a command-line tool like sed. Missing values can be another problem. read.table understands that empty fields are missing, but maybe you or your collaborator has used another character to mean missing, like "-" or "?". This is dealt with by specifying your own na.strings:

data <- read.table("data.txt", na.strings=c("NA", "-", "?"))

R's default of interpreting characters as factors is sometimes a good choice, but often what you really want is characters that you can work with and then, if needed, turn into factors. In particular, if your column of sample ids is turned into a factor, comparing and matching ids will sometimes hurt you. To avoid that you can specify stringsAsFactors=F. This works when you make data frames with the data.frame function as well.

data <- read.table("data.txt", stringsAsFactors=F)

Diagnosing problems with your data frame

Look at the dimensions of your data frame and the column of the classes. dim gives dimensions; the class function gives the type of a column and the str function will give you a summary of the structure of the object.

lapply(data, class)

Happy importing!

Written by mrtnj

6 mars, 2014 at 21:31

Publicerat i computer stuff, data analysis, english

Tagged with , ,

Using R: correlation heatmap, take 2

with 3 comments

Apparently, this turned out to be my most popular post ever.  Of course there are lots of things to say about the heatmap (or quilt, tile, guilt plot etc), but what I wrote was literally just a quick celebratory post to commemorate that I’d finally grasped how to combine reshape2 and ggplot2 to quickly make this colourful picture of a correlation matrix.

However, I realised there is one more thing that is really needed, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. We use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, it’s very easy to get a scale that has a midpoint and a different colour in each direction. It’s called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!

data <- airquality[,1:4]
qplot(x=Var1, y=Var2, data=melt(cor(data, use="p")), fill=value, geom="tile") +
   scale_fill_gradient2(limits=c(-1, 1))


Written by mrtnj

3 mars, 2014 at 19:49

Books and lessons about ggplot2

I recently got an email from a person at Packt publishing, who suggested I write a book for them about ggplot2. My answer, which is perfectly true, is that I don’t have the time, nor the expertise to do that. What I didn’t say is that 1) a quick web search suggests that Packt doesn’t have the best reputation and 2) there are already two books about ggplot2 that I think covers the entire field: the indispensable ggplot2 book, written by Hadley Wickham, the author of the package, and the R Graphics Cookbok by Wincent Chang. There are too many decent but not great R books on the market already and there is no reason for me to spend time to create another one.

However, there are a few things I’d like to tell the novice, in general, about ggplot2:

1. ggplot2 is not necessarily superior to lattice or base graphics. It’s largely a matter of taste, you can make very nice plots with either plotting system and in some situations everybody will turn away from their favourite system and use another. For instance, a lot of built-in diagnostic plots come preprogrammed in base graphics. The great thing about ggplot2 is how it allows the user to think about plots as layers of mappings between variables and geometric objects. Base graphics and lattice are organised in different ways, which is not necessarily worse; it depends on the way you’re accustomed to thinking about statistical graphics.

2. There are two ways to start a plot in ggplot2: qplot( ) or ggplot( ). qplot is the quicker way and probably the one you should learn first. But don’t be afraid of the ggplot function! All it does is set up an empty plot and connect a data frame to it. After that you can add your layers and map your variables almost the same way you would do with qplot. After you’ve become comfortable with qplot, just try building plots with ggplot a few times, and you’ll see how similar it is.

3. The magic is not in plotting the data but in tidying and rearranging the data for plotting. Make sure to put all your labels, indicators and multiple series of data into the same data frame (most of the time: just cbind or merge them together), melt the data frame and pass it to ggplot2. If you want to layer predictions on top of your data, put output from your model in another data frame. It is perfectly possible, often advisable, to write functions that generate ggplot2 plots, but make sure to always create the data frame to be plotted first and then pass it on. I suggest not trying to create the mappings, that is x= and y= and the like, on the fly. There is always the risk of messing up the order of the vectors, and also, because ggplot2 uses metaprogramming techniques for the aesthetics, you might see unexpected behaviours when putting function calls into the mapping.

4. Worry about mapping variables and facetting first and then change the formatting. Because of how plots as well as settings in ggplot2 are objects that you can store and pass around, you can first create a raw version of the plot, and then just add (yes add, with the ”+” operator) the formatting options you want. So taken together, the workflow for making any ggplot2 plot goes something like this: 1) put your data frame in order; 2) set up a basic plot with the qplot or ggplot function; 3) add one extra geom at at time (optional if you make a simple plot with qplot, since qplot sets up the first geometry for you); 4) add the settings needed to make the plot look good.

Happy ggplotting!

Written by mrtnj

20 februari, 2014 at 00:31

Publicerat i computer stuff, data analysis, english

Tagged with ,

Bracketing part of a dna sequence in Javascript

leave a comment »

Sometimes you just need a quick script to make life easier. This is an example of a small task: You have a dna sequence extracted from some bigger corpus, such as a reference genome sequence, and you know that there is something interesting going on 200 bp into the sequence. So you’ve downloaded a fasta file, and you want the computer to count 200 bases of flanking sequence and highlight the interesting thing for you. This problem arises naturally when designing primers, for example. You can use any scripting language you want, but if you want it to be used by a person who does not (yet) use command line based tools, if you’d like it to run on their computer that runs a different operating system than yours does, and if want to make a graphical user interface as quickly as possible — Javascript is actually an excellent weapon of choice.

Therefore, the other day, I wrote this little thing in Javascript. Mind you, it was ages since I last touched Javascript or html, and it’s certainly not a thing of beauty. But it does save time and I think I’ve avoided the most egregious errors. Everything is contained in the html page, so it’s actually a small bioinformatics script that runs in the browser.

Here is the Javascript part: a single procedure that extracts the parameters (the sequence and the number of flanking bases before and after), uses regular expressions to remove any fasta header, whitespaces and newlines, slices the resulting cleaned sequence up in three pieces, and outputs it with some extra numbers.

function bracketSequence (form) {
   var before = form.flank_before.value;
   var after = form.flank_after.value;
   var seq = form.sequence.value;

   // Find and remove fasta header
   fastaHeader = seq.match(/^>.*[\r\n|\n|\r]/, "");
   seq = seq.replace(/(^>.*(\r\n|\n|\r))/, "");

   // Remove whitespace and newlines
   seq = seq.replace(/(\r\n|\n|\r| )/gm, "");

   seqBefore = seq.slice(0, before);
   seqCore = seq.slice(before, seq.length - after);
   seqAfter = seq.slice(seq.length - after, seq.length);
   form.output.value = seqBefore + "[" + seqCore + "]" + seqAfter;

   form.fasta_header.value = fastaHeader;

   document.getElementById("core_length").innerHTML = seqCore.length;
   document.getElementById("before_length").innerHTML = seqBefore.length;
   document.getElementById("after_length").innerHTML = seqAfter.length;

I know, I know -- the regular expressions look like emoticons that suffered through a serious accident. The fasta header expression "^>.*[\r\n|\n|\r]/" matches the beginning of the file up to any linebreak character. The other expression "(\r\n|\n|\r| )" just matches linebreaks and whitespace. This means the fasta header has to start with the ">"; it cannot have any whitespace before. Other than whitespace, strange characters in the sequence should be preserved and counted.

The input and output uses a html form and a couple of named pieces of html ('spans'). Here we put two textboxes to input the numbers, a big textarea for the sequence, and then a textarea, a textbox and some spans for the output. The onClick action of the button runs the procedure.

<form name="form" action="" method="get">

<p>Flanking bases before: <input type="text" name="flank_before"
value="" /></p>

<p>Flanking bases after: <input type="text" name="flank_after"
value="" /></p>

<p>Your sequence:</p>

<textarea name="sequence" rows="5" cols="80">

<p><input type="button" name="button" value="Go"
onClick="bracketSequence(this.form)" /></p>

<p>Output sequence:</p>

<textarea name="output" rows="5" cols="80">

<p>Fasta header: <input type="text" name="fasta_header" value="" /></p>

<p>Length of core sequence: <span id="core_length"></span></p>
<p>Length of flank before: <span id="before_length"></span></p>
<p>Length of flank after: <span id="after_length"></span></p>


Written by mrtnj

21 november, 2013 at 18:03


Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 135 andra följare