There is grandeur in this view of life

martins bioblogg

Archive for the ‘english’ Category

Finding the distance from ChIP signals to genes

with one comment

I’ve had a couple of months off from blogging. Time for some computer-assisted biology! Robert Griffin asks on Stack Exchange about finding the distance between HP1 binding sites and genes in Drosophila melanogaster.  We can get a rough idea with some public chromatin immunoprecipitation data, R and the wonderful BEDTools.

Finding some binding sites

There are indeed some ChIP-seq datasets on HP1 available. I looked up these ones from modENCODE: modENCODE_3391 and modENCODE_3392, using two different antibodies for Hp1b in 16-24 h old embroys. I’m not sure since the modENCODE site doesn’t seem to link datasets to publications, but I think this is the paper where the results are reported: A cis-regulatory map of the Drosophila genome (Nègre & al 2011).

What they’ve done, in short, is cross-linking with formaldehyde, sonicate DNA into fragments, capture fragments with either of the two antibodies and sequence those fragments. They aligned reads with Eland (Illumina’s old proprietary aligner) and called peaks (i.e. regions where there is a lot of reads, which should reflect regions bound by Hp1b) with MACS. We can download their peaks in general feature format.

I don’t know whether there is any way to make completely computation predictions of Hp1 binding sites but I doubt it.

Some data cleaning

The files are available from ftp, and for the below analysis I’ve unzipped them and called them modENCODE_3391.gff3 and modENCODE_3392.gff3. GFF is one of all those tab separated text files that people use for genomic coordinates. If you do any bioinformatics type work you will have to convert back and forth between them and I suggest bookmarking the UCSC Genome Browser Format FAQ.

Even when we trust in their analysis, some processing of files is always required. In this case, MACS sometimes outputs peaks with negative start coordinates in the beginning of a chromosome. BEDTools will have none of that, because ”malformed GFF entry at line … Start was greater than end”. In this case, it happens only at a few lines, and I decided to set those start coordinates to 1 instead.

We need a small script to solve that. As I’ve written before, any language will do, but I like R and tend to do my utility scripting in R (and bash). If the files were incredibly huge and didn’t fit in memory, we’d have to work through the files line by line or chunk by chunk. But in this case we can just read everything at once and operate on it with vectorised R commands, and then write the table again.

modENCODE_3391 <- read.table("modENCODE_3391.gff3", stringsAsFactors=F, sep="\t")
modENCODE_3392 <- read.table("modENCODE_3392.gff3", stringsAsFactors=F, sep="\t")

fix.coord <- function(gff) {
  gff$V4[which(gff$V4 < 1)] <- 1
  gff
}

write.gff <- function(gff, file) {
  write.table(gff, file=file, row.names=F, col.names=F,
              quote=F, sep="\t")
}

write.gff(fix.coord(modENCODE_3391), file="cleaned_3391.gff3")
write.gff(fix.coord(modENCODE_3392), file="cleaned_3392.gff3")

Flybase transcripts

To find the distance to genes, we need to know where the genes are. The best source is probably the annotation made by Flybase, which I downloaded from the Ensembl ftp in General transfer format (GTF, which is close enough to GFF that we don’t have to care about the differences right now).

This file contains a lot of different features. We extract the transcripts and find where the transcript model starts, taking into account whether the transcript is in the forward or reverse direction (this information is stored in columns 4, 5 and 7 of the GTF file). We store this in a new GTF file of transcript start positions, which is the one we will feed to BEDTools:

ensembl <- read.table("Drosophila_melanogaster.BDGP5.75.gtf",
                      stringsAsFactors=F, sep="\t")

transcript <- subset(ensembl, V3=="transcript")
transcript.start <- transcript
transcript.start$V3 <- "transcript_start"
transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+",
                                                     transcript$V4, transcript$V5)

write.gff(transcript.start, file="ensembl_transcript_start.gtf")

Finding distance with BEDTools

Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It’s a command line tool, and if you haven’t already you will need to download and compile it, which I recall being painless.

bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ”>” to capture it in a text file.

Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file.

#!/bin/bash
Rscript clean_files.R

bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \
    > closest_element_3391.txt
bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \
    > closest_element_3392.txt

Some results

With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site.

closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t")

library(ggplot2)
qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) +
  facet_wrap(~V2, scale="free_y")

distance_hist

Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks.

closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t")

combined <- merge(closest_3391, closest_3392,
                  by.x=c("V1", "V2","V4", "V5", "V9"),
                  by.y=c("V1", "V2","V4", "V5", "V9"))

qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)

between_antibodies

(If you’re wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.)

About the DGRP

The question mentioned the DGRP. I don’t know that anyone has looked at ChIP in the DGRP lines, but wouldn’t that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL … What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don’t know if that would tell you much — does anyone know what kind of variant would affect Hp1 binding?

Happy hacking!

Written by mrtnj

4 juli, 2014 at 20:08

Journal club of one: ”Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene” (preprint)

with 4 comments

This preprint was posted on bioRxiv and Haldane’s sieve. It tells the story of one of the best known genetic variants affecting behaviour, the foraging gene in Drosophila melanogaster. for is still a nice example of a large-effect variant causing (developmentally) pleiotropic effects. However, Turner & al present evidence questioning whether for has any substantial effect in natural populations of flies. I think it’s self-evident why I’m interested.

They look at previous evidence for foraging as a quantitative trait gene in files sampled from natural populations and perform genome-wide association and population genetic tests with 35 DGRP lines, finding nothing at the for locus.

Comments:

(Since this is a preprint, I will feel free to suggest what I think could be improvements to the manuscript. Obviously, these are just my opinions.)

I’m not convinced one can really separate a unimodal from a bimodal distribution with 36 data points? Below are a few histograms simulated from a mixture of two normal distributions where 25 samples are ”rovers” and 11 ”sitters”.

bimodal

For fun, I also tested for normality with the Shapiro-Wilks’ test as the authors did, and about half of 1000 tests reject. My histograms should not be overinterpreted; I just generated two normal distributions with means log10(2.66) and log10(1.3) with standard deviations 0.1. I don’t know the actual standard deviations of the forS and forR reference strains. Of course, when the standard deviation is small enough, the distributions clearly separate and Shapiro-Wilks’ test will reject.

Power is difficult, but in this case the authors are looking at a well-known effect. They should be able to postulate some reasonable effect-sizes given the literature and the difference between the reference strains and make sure that they’re actually powered to detect it. 35 individuals for a GWAS is not much. They may still have good power to detect a effect of the size expected at for, at least in the single-point test, but it would be nice to demonstrate it. Power feels particularly pertinent as the authors claim to find evidence of absence. The same thing should apply to the population genetic tests, though it’s probably harder to know what effects to expect there.

The authors discuss alternative interpretations, and mention  the fact that in their hands the reference strains did not travel nearly as long as in previous experiments. How likely is it, though, that the variant isn’t segregating in Raleigh but in the populations previously sampled?

Literature

Thomas Turner, Christopher C Giauque, Daniel R Schrider, Andrew D Kern. (2014) Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene. Preprint on bioaRxiv. doi: 10.1101/004325

Written by mrtnj

30 april, 2014 at 19:59

”Made obvious by our use of contraceptives”

leave a comment »

I recently reread part of The Selfish Gene. The introduction to the 30th anniversary edition is great fun. For one thing, Dawkins expresses doubts about the word ”selfish” in the title, and ponders whether he should have called it the Immortal or Cooperative gene instead. That feels very ironic, and I for one think that he made the right choice. It also contains this nugget:

Our brains have evolved to a point where we are capable of rebelling against our selfish genes. The fact that we can do so is made obvious by our use of contraceptives. The same principle can and should work on a larger scale.

Written by mrtnj

26 april, 2014 at 11:55

Publicerat i citerat, english

Tagged with , ,

Bibliometrics and I

leave a comment »

Dear diary,

I’m attending a course about scientific publishing, and the other day there was lecture about bibliometrics by Lovisa Österlund and David Lawrence from the Linköping University library. I don’t think I know anyone who particularly likes bibliometrics, but I guess it makes sense that if one needs to evaluate research without trying to understand what it is about there are only citations, the reputation of the publication channel and the cv of the researcher to look at. I imagine it’s a bit like reviewing a novel in a language one doesn’t know. A couple of things occured to me, though.

What to do when different instruments of evaluation give different results? Take the two papers (so far) published during my PhD: they both deal with the genetics of chicken comb size; one is published in PLOS Genetics and one in Molecular Ecology. If we look at journal impact factors (and we shouldn’t, but say that we do), PLOS Genetics comes out ahead with an impact factor of 8.5 against 6.3. For those that do not know this about it, journal impact factor is the mean number of citations for papers in that journal the last two years calculated by Thomson Reuters in their own secret way. However, Linköping University has for some reason decided to use the Norwegian index for evaluating publication channels. I don’t know why, and I don’t think it matters that much for me personally, since the system will change soon and I will finish in about a year and a half. In the Norwegian system journals are ranked as level one or two, where two is better and is supposed to represent the top 20% of that subject area. According to their database, Molecular ecology is level 2, while PLOS Genetics is level 1. The source of the discrepancy is probably that PLOS Genetics is counted as biomedicine, while Molecular Ecology is biology, according to the Norwegian database.

They also mentioned Altmetrics, and I don’t know what to make of it. On one hand, I guess it’s good to keep tabs on social media. On the other hand, what do numbers of tweets really tell you, except that one of the authors has a Twitter account? One of the examples in the lecture was the metrics page for this paper that I happen to be a contributor to. It is actually pretty strange. It shows three tweets or 11 tweets, depending on where on the page you look. Also, when I accessed this page earlier today it linked a blog. Now it doesn’t. That says something about the ephemeral nature of internet media. Regardless, when I first saw the page I thought perhaps the metrics page had picked up on my post about the paper, but that was not the case. I don’t know how altmetric.com define a ”science blog”, maybe the blog has to be listed on some aggregation site or another, and I’m not pretending my post is particularly insightful or important. Still it’s a little strange that the altmetrics page doesn’t list a post by one of the authors about the paper, but listed a post that referred to the paper with only two sentences and was mistaken about the conclusion.

Written by mrtnj

24 april, 2014 at 18:05

Morning coffe: ”epigenetics” is also ambiguous

leave a comment »

IMG_20140228_175448

I believe there is an analogy between the dual meaning of the word ”gene” and two senses of epigenetics, that this distinction is easy to get wrong and that it contributes to the confusion about the meaning of epigenetics. Gene can mean a sequence that has a name and a function, or it can mean a genetic variant. I sometimes, half-jokingly, call this genetics(1) and genetics(2). The order is wrong from a historical perspective, since the study of heritable variation predates the discovery of molecular genes. The first deals with the function of sequences and their products. The second deals with differences between individuals carrying different variants.

The same can be said about epigenetics. On one hand there is epigenetics(1), aiming to understand the normal function of certain molecular features, i.e. gene regulatory states that can be passed on through cell division. On the other hand, epigenetics(2) aims to explain individual variation between individuals that differ not in their DNA sequence but in other types of heritable states. And the recurring reader knows that I think that, since a lot of genetics(2) makes no assumptions about the molecular nature of the variation it studies, it will mostly work even if some of these states turn out to be epigenetic. In that sense, epigenetics(2) is a part of genetics.

Written by mrtnj

23 april, 2014 at 07:30

E. O. Wilson and B. F. Skinner

leave a comment »

E.O. Wilson: This is going to be a conversation that I will have with B.F. Skinner. This is Ed Wilson. He invited me to talk about sociobiology. Our relations have always been very friendly and I look forward to it. This should be an interesting talk this Thursday morning.
B.F. Skinner: We will start with a basic statement. I assume that you are what I call a behaviorist. You would accept that an organism is a biophysical and biochemical system, a product of evolution.
E.O. Wilson: I am.

B.F. Skinner: That would include not only genetic behavior, but also the kinds of behavior that can be learned because of genetic processes. Of course it (behavior) always goes back to genetics.

Naour, P (2009) E. O. Wilson and B. F. Skinner. A dialogue between sociobiology and radical behaviorism. New York: Springer.

Written by mrtnj

21 april, 2014 at 16:07

Journal club of one: ”Maternal and additive gentic effects contribute to variation in offspring traits in a lizard”

leave a comment »

The posts this week have been about epigenetics. However, let’s step back from the molecular mechanisms and what not to look at the bigger picture. This recent paper by Noble, McFarlane, Keogh and Whiting (2014) looks at maternal effects and additive genetic effects on fitness-related traits in a lizard. Now we are in quantitative genetics territory where one uses pedigrees and phenotypes to look at the determinants of a trait while abstracting away the mechanistic details. Nowadays, quantitative genetics is also equipped with Bayesian animal models and the ability to do parentage assignment with molecular methods.

The authors measured at size, body mass, and growth and as well as the speed and endurance when running. The fun part is that while only endurance had a substantial heritability (0.4), the other traits had maternal components in the 0.2-0.5 range. So for most of the traits there’s little heritability while a big chunk of the trait variance is explained by maternal effects.

Comments:

I like the idea to include maternal traits to see look at what causes the maternal effect. Clutch size, maternal size and condition seem matter for some trait or another. In two cases the maternal effect is entirely explained away: the effect on growth by birth date and clutch size, and sprint speed by birth date.

The inferences come from an animal model that include a maternal effect. Something I’m curious about is how heritability would be overestimated if the maternal component was not accounted for. That is beside the point of the paper, though.

Another interesting point: I think everyone who deals with animals in some type of controlled environment wonder about how much our measurements differ from what would’ve been measured in a more natural environment. In this case, the authors measured offspring growth both in the test environment and in an enclosure. They find a maternal effect in the test environment, while the interval for the heritability goes from almost zero to 0.5. In the wilder environment they estimate very little genetic and maternal variance, as well as a larger residual variance. I don’t know if this is just because of increased noise, or because maternal effects actually interact with condition.

Also, I love figure 1 (the one figure). If more papers had caterpillar plots of most important estimated quantities, the world would be a better place.

Literature

Noble, D. W., McFarlane, S. E., Keogh, J. S., & Whiting, M. J. (2014). Maternal and additive genetic effects contribute to variation in offspring traits in a lizard. Behavioral Ecology, aru032.

Written by mrtnj

11 april, 2014 at 18:46

Följ

Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 168 andra följare