There is grandeur in this view of life

martins bioblogg

Archive for the ‘english’ Category

Last year in Marseille and the EBM18 book

leave a comment »

The EBM in Marseille was about a year ago (September 2014), but I don’t mind a bit of blog anachronism. I post this from the European Society for Evolutionary Biology conference in Lausanne. If you happen to be here, you can see me talk about signatures of selection in feralisation in Symposium 20 on Tuesday afternoon.

If you saw a bearded man carrying a pink bag scrambling towards Gare de Marseille Saint-Charles (pictured below; an incredibly beautiful train station) while eating boiled potatoes from a plastic bag, you may have witnessed my stylish departure from Marseille. This was during the Air France strike, and I had just learned that I could catch a train to Nice and go from there. In a moment of brilliance or cheapness, I also decided to spend the night at the airport Nice Côte d’Azur.


The conference itself was nothing short of wonderful. There were many interesting talks, but it was small enough that everything fit in one track, and there was plenty of time to meet people. The conference also ended with the nicest social activity at any conference I’ve been to: a group of participants went for a walk around Marseille with Pierre Pontarotti and his cute little dog. Myself, I presented our comb size work (2012 paper, 2014 paper and some new stuff). I felt like it went rather well. It seems someone else in the scientific committee thought so too, because I got invited to write a chapter for the book with meeting participants that they make each year. The invitation was to write an overview of the field we talked about, so I wrote about ”The genomics of sexual ornamentation, gene identification and pleiotropy”. One can have a look at the chapter on Google Books. The chapter goes through genomic studies (mostly QTL mapping and gene expression microarrays) on sexual ornaments, and some of the problems and promises.

I am not really sure when the book came out; I saw it popping up on Google Scholar the other day, but I haven’t seen the final version of my chapter. I assume a book is on it’s way or waiting for me when I get back from ESEB.

Pontarotti Pierre (Ed). (2015) Evolutionary Biology: Biodiversification from Genotype to Phenotype. Springer.

Written by mrtnj

9 augusti, 2015 at 23:35

Publicerat i dear diary, english

@sweden recap

leave a comment »

So, a couple of weeks ago I tweeted from the @sweden account. This is a short recap of some things that were said, and a few links that I promised people. Overall I think it went pretty well. I didn’t tweet as much as some other curators, but much much more than I usually do. This also meant I did spend my lunch and coffee breaks looking at my phone. My tweets are collected here, if for some reason you’d care to read them.

Of course, tweeting from a rotating curation account is very different from the way I normally use Twitter. First, I read much more than I write. One of the main purposes of Twitter, for me, is to get a steady stream of links to read. That doesn’t really work on an account that follows much more and entirely different people. A lot of what I wrote was prepared monologue, but I don’t think that’s necessarily a bad thing. I follow a lot of people on Twitter for their monologues. Also, thankfully, a lot of people asked me questions! Another thing that struck me is that so few people were unpleasant. There were a few extreme right folks who wanted me to retweet their racist tweets, but only a few. Then, a few felt the need to tell me that I’m utterly boring, which is fine. Someone lamented the fact that all curators are uneducated about the proper use of Twitter (it’s probably to build your personal brand or something). Also, a certain Swedish celebrity got put on ignore so I wouldn’t have to see him tagging each tweet with ”@sweden”. But that was pretty much all.

I talked quite a bit about my research. I spent more or less a full day on the chicken comb as a sexual ornament and genetics of comb mass. We discussed domestication as an evolutionary process, tonic immobility, and how to measure gene expression for eQTL mapping. I also wrote about Kauai feral chickens … And what I actually do in a day nowadays, that is: writing R code.

I got a question about what to say to your creationist friend. I think this depends on what the creationist friend believes and what their objections to evolution are. Unfortunately, I don’t think there is a simple knock-down argument against all forms of creationism, except that evolution works really well and has a lot of evidence going for it. I certainly don’t think it will do to rely on methodological naturalism and say that ”creation would be a supernatural event and outside the scope of science”. First, because I don’t think that is how science works. Say if unicorns, miraculous healing, and species popping into existence without relation to other species were actually part of the world, wouldn’t we want to study that? Second, that will never convince anyone, except of the irrelevance of science to their worldview.

But I think there are a handful of things that creationists often take issue with. First, some don’t believe in sequence variants creating new functions. This is often described with slogans about information, and how it cannot be created by random mutation. I don’t think ”sequence information” is a particularly useful concept, and would much prefer to talk about function and adaptation. That is what is important, after all, organisms acquiring new adaptations. It turns out, new functions arising can be observed, particularly in microorganisms. Some really fun and well-studied example occur in the Long Term Evolution Experiment; see Richard Lenski’s blog which has explanatory posts and links to papers.

Second, the formation of species come up a lot in these discussions. This is a bit tricky, because it’s not always clear what constitutes different species. The definition most people have heard is probably that individuals belong to different species if they cannot have fertile offspring. But just think of asexually reproducing organisms. There, individuals belong to different species if they’re sufficiently different. So we already have what is needed to understand the formation of species in the evolution of new functions. When it comes to sexually reproducing organisms, there are examples of the evolution of reproductive isolation — cases where it seems to be ongoing or to have happened recently. (See for instance this paper on hybrid incompatibility in Mimulus guttatus; I have blogged about it, but only in Swedish)

Third, there is the question of relatedness between species. In particular, some creationists really hate the idea that humans are apes. I think it is important to emphasize a couple of things that evolution does not say about humans and other apes. By the way, this isn’t just confusing for creationists, but for everyone. Evolution does not mean that humans descend from extant apes. Look at this phylogenetic tree from Perelman & al 2011. This is just like a family tree, but of populations: we see how chimps and humans have a recent common ancestor population. This is different than claiming that we would descend from extant chimps. Of course, chimps have also changed since the common ancestor, although not in the same ways as humans. (Again, I’ve written about this before in Swedish.)


Speaking of unicorns, I of course celebrated unicorn Friday:

Someone asked whether you can keep fruit flies for amateur genetics at home. That should be quite possible, and I don’t see any real problems with it either. The fruit fly community has really strong culture of classical genetics with crosses and stocks. I don’t know if stock centres would deliver to private customers, but I don’t see why they wouldn’t — except for transgenic flies. It turned out, however, that transgenic flies was actually what the person asking was after. And of course, I can’t recommend that. I must say, I have mixed feelings about do-it-yourself biotechnology. On the one hand, some home molecular biology should be possible and rewarding. On the other hand, a lot of things routinely used in molecular labs are actually really dangerous if misused, and not just for the user. For example, when making any type of construct in transgenic bacteria, antibiotics and antibiotic resistance genes are the standard screening markers. They are used to pick out the bacteria that have incorporated the piece of DNA you care about. This is not the kind of stuff you want to use without proper containment. So, in the fly example, you would not only have to handle the flies, but also transgenic antibiotic resistant bacteria safely and legally. Then again, a lot of the genetics I care about does not involve any of that, and could very well be done in a basement.

The @sweden account caught me under a teaching week; otherwise, all of my photos would’ve been my computer, my pen and my coffee mug. Now I got to walk the followers through agarose gel electrophoresis and a little transformation of bacteria:

And, finally, Swedish spring:

Written by mrtnj

4 maj, 2015 at 19:45

Paper: ”Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs”

with one comment

We have a new paper almost out (now in early view) in Molecular Ecology about the chickens on the Pacific island Kauai. These chickens are pretty famous for being everywhere on the island. Where do they come from? If you use your favourite search engine you’ll find an explanation with two possible origins: ancient wild birds brought over by the Polynesians and escaped domestic chickens. This post on Kauaiblog is great:

Hawaii’s official State bird is the Hawaiian Goose, or Nene, but on Kauai, everyone jokes that the “official” birds of the Garden Island are feral chickens, especially the wild roosters.

Wikepedia says the “mua” or red jungle fowl were brought to Kauai by the Polynesians as a source of food, thriving on an island where they have no real predators. /…/
Most locals agree that wild chickens proliferated after Hurricane Iniki ripped across Kauai in 1992, destroying chicken coops and releasing domesticated hens, and well as roosters being bred for cockfighting. Now these brilliantly feathered fowl inhabit every part of this tropical paradise, crowing at all hours of the day and night to the delight or dismay of tourists and locals alike.

In this paper, we look at phenotypes and genetics and find that this dual origin explanation is probably true.


(Chickens on Kauai. This is not from our paper, but by Jeff Trimble (cc:by-sa-nc) published on Flickr. There are so many pretty chicken pictures there!)

Dom, Eben, and Pamela went to Kauai to photograph, record to and collect DNA from the chickens. (I stayed at home and did sequence bioinformatics.) The Kauai chickens look and sound like mixture of wild and domestic chickens. Some of them have the typical Junglefowl plumage, and other have flecks of white. Their crows vary in the length of the characteristic fourth syllable. Also, some of them have yellow legs, a trait that domestic chickens seem to have gotten not from the Red but from the Grey Junglefowl.

We looked at DNA sequences by massively parallel (SOLiD) sequencing of 23 individuals. We find mitochondrial sequences that fall in two haplogroups: E and D. The presence of the D haplogroup, which is the dominating one in ancient DNA sequences from the Pacific, means that there is a Pacific component to their ancestry. The E group, on the other hand, occurs in domestic chickens. It also shows up in some ancient DNA samples from the Pacific, but not from Kauai (and there is a scientific debate about these sequences). The nuclear genome analysis is pretty inconclusive. I think what we would need is some samples of possible domestic source populations (Where did the escapee  chickens came from? Are there other traditional domestic sources?) and a better sampling of Red Junglefowl to make better sense of it.

When we take the plumage, vocalisation and mitochondrial DNA together, it looks like this is a feral admixed population of either Red Junglefowl or traditional Pacific chickens mixed with domestics. A very interesting population indeed.

Kenneth Chang wrote about the paper in New York Times; includes quotes from Eben and Dom.

E Gering, M Johnsson, P Willis, T Getty, D Wright (2015) Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs. Molecular ecology

Written by mrtnj

8 april, 2015 at 20:46

Morning coffee: cost per genome

leave a comment »

I recently heard this thing referred to as ”the most overused slide in genomics” (David Klevebring). It might be: what it shows is some estimate of the cost of sequencing a human genome over time, and how it plummets around 2008. Before that, the curve is Sanger sequencing, and then the costs show second generation sequencing (454, Illumina and SOLiD).


The source is the US National Human Genome Research Institute, and they’ve put some thought into how to estimate costs so that machines, reagents, analysis and people to do the work are included and that the different platforms are somewhat comparable. One must first point out that downstream analysis to make any sense of the data (assembly and variant calling) isn’t included. But the most important thing that this graph hides, even if the estimates of the cost would be perfect, is that to ”sequence a genome” means something completely different in 2001 and 2015. (Well, with third generation sequencers that give long reads coming up, the old meaning might come back.)

For data since January 2008 (representing data generated using ‘second-generation’ sequencing platforms), the ”Cost per Genome” graph reflects projects involving the ‘re-sequencing’ of the human genome, where an available reference human genome sequence is available to serve as a backbone for downstream data analyses.

The human genome project was of course about sequencing and assembling the genome into high quality sequences. Very few of the millions of human genomes resequenced since are anywhere close. As people in the sequencing loop know, resequencing with short reads doesn’t give you a genome sequence (and neither does trying to assemble a messy eukaryote genome with short reads only). It gives you a list of variants compared to the reference sequence. The usual short read business has no way of detect anything but single nucleotide variants and small indels. (And the latter depends … Also, you can detect copy number variants, but large scale structural variants are mostly off the table.) Of course, you can use these edits to reconstruct a consensus sequence from the reference, but it would be a total lie.

Again, none of this is news for people who deal with sequencing, and I’m not knocking second-generation sequencing. It’s very useful and has made a lot of new things possible. It’s just something I think about every time I see that slide.

Written by mrtnj

2 april, 2015 at 07:15

R in genomics @ SciLifeLab, Solna

leave a comment »

Dear diary,

I went to the Stockholm R useR group meetup on R in genomics at the Stockholm node of SciLifeLab. It was nice. If I had worked a bit closer I would attend meetups all the time. I even got to be pretentious with my notebook while waiting for the train.


The speakers were:

Jakub Orzechowski Westholm on R and genomics in general. He demonstrated genome browser-style tracks with Gviz, some GenomicRanges, and a couple of common plots of gene expression data. I have been on the fence about what package I should use for drawing genes and variants along the genome. I should play with Gviz.

Daniel Klevebring on clinical sequencing and how he uses R (not that much) in sequencing pipelines aimed at targeting the right therapy to patients based on the mutations in their cancer cells. He mentioned some getopt snippets for getting R to play nicely on the command line, which is something I should definitely try more!

Finally, Arvind Singh Mer on predictive modelling for clinical genomics (like the abovementioned ClinSeq data). He showed the caret package for machine learning, with an elastic net regression.

I don’t know the rest of the audience, so maybe the choice to gear talks towards the non-bio* person was spot on, but that made things a bit less interesting for me. For instance, in Jakub’s talk about gene expression, I would’ve preferred more about the messy stuff: how to make that nice gene-by-sample matrix in the first place, and if R can be of any help in that process; also, in the other end, what models one would use after that first pass of visualisation. But this isn’t a criticism of the presenters — time and complexity constraints apply. (If I was asked to present how I use R any demos would be toy analyses of clean datasets. That is the way these things go.)

We also heard repeated praise for and recommendations of the hadleyverse and data.table. I’m not a data.tabler myself, but I probably should be. And I completely agree about the value of dplyr — there’s this one analysis where a couple of lines with dplyr changed it from ”argh, do I have to rewrite this in C?” to being workable. I think we also saw all the three plotting systems: base graphics, ggplot2 and lattice in action.

Written by mrtnj

24 mars, 2015 at 17:40

Finding the distance from ChIP signals to genes

with one comment

I’ve had a couple of months off from blogging. Time for some computer-assisted biology! Robert Griffin asks on Stack Exchange about finding the distance between HP1 binding sites and genes in Drosophila melanogaster.  We can get a rough idea with some public chromatin immunoprecipitation data, R and the wonderful BEDTools.

Finding some binding sites

There are indeed some ChIP-seq datasets on HP1 available. I looked up these ones from modENCODE: modENCODE_3391 and modENCODE_3392, using two different antibodies for Hp1b in 16-24 h old embroys. I’m not sure since the modENCODE site doesn’t seem to link datasets to publications, but I think this is the paper where the results are reported: A cis-regulatory map of the Drosophila genome (Nègre & al 2011).

What they’ve done, in short, is cross-linking with formaldehyde, sonicate DNA into fragments, capture fragments with either of the two antibodies and sequence those fragments. They aligned reads with Eland (Illumina’s old proprietary aligner) and called peaks (i.e. regions where there is a lot of reads, which should reflect regions bound by Hp1b) with MACS. We can download their peaks in general feature format.

I don’t know whether there is any way to make completely computation predictions of Hp1 binding sites but I doubt it.

Some data cleaning

The files are available from ftp, and for the below analysis I’ve unzipped them and called them modENCODE_3391.gff3 and modENCODE_3392.gff3. GFF is one of all those tab separated text files that people use for genomic coordinates. If you do any bioinformatics type work you will have to convert back and forth between them and I suggest bookmarking the UCSC Genome Browser Format FAQ.

Even when we trust in their analysis, some processing of files is always required. In this case, MACS sometimes outputs peaks with negative start coordinates in the beginning of a chromosome. BEDTools will have none of that, because ”malformed GFF entry at line … Start was greater than end”. In this case, it happens only at a few lines, and I decided to set those start coordinates to 1 instead.

We need a small script to solve that. As I’ve written before, any language will do, but I like R and tend to do my utility scripting in R (and bash). If the files were incredibly huge and didn’t fit in memory, we’d have to work through the files line by line or chunk by chunk. But in this case we can just read everything at once and operate on it with vectorised R commands, and then write the table again.

modENCODE_3391 <- read.table("modENCODE_3391.gff3", stringsAsFactors=F, sep="\t")
modENCODE_3392 <- read.table("modENCODE_3392.gff3", stringsAsFactors=F, sep="\t")

fix.coord <- function(gff) {
  gff$V4[which(gff$V4 < 1)] <- 1

write.gff <- function(gff, file) {
  write.table(gff, file=file, row.names=F, col.names=F,
              quote=F, sep="\t")

write.gff(fix.coord(modENCODE_3391), file="cleaned_3391.gff3")
write.gff(fix.coord(modENCODE_3392), file="cleaned_3392.gff3")

Flybase transcripts

To find the distance to genes, we need to know where the genes are. The best source is probably the annotation made by Flybase, which I downloaded from the Ensembl ftp in General transfer format (GTF, which is close enough to GFF that we don’t have to care about the differences right now).

This file contains a lot of different features. We extract the transcripts and find where the transcript model starts, taking into account whether the transcript is in the forward or reverse direction (this information is stored in columns 4, 5 and 7 of the GTF file). We store this in a new GTF file of transcript start positions, which is the one we will feed to BEDTools:

ensembl <- read.table("Drosophila_melanogaster.BDGP5.75.gtf",
                      stringsAsFactors=F, sep="\t")

transcript <- subset(ensembl, V3=="transcript")
transcript.start <- transcript
transcript.start$V3 <- "transcript_start"
transcript.start$V4 <- transcript.start$V5 <- ifelse(transcript.start$V7 == "+",
                                                     transcript$V4, transcript$V5)

write.gff(transcript.start, file="ensembl_transcript_start.gtf")

Finding distance with BEDTools

Time to find the closest feature to each transcript start! You could do this in R with GenomicRanges, but I like BEDTools. It’s a command line tool, and if you haven’t already you will need to download and compile it, which I recall being painless.

bedtools closest is the command that finds, for each feature in one file, the closest feature in the other file. The -a and -b flags tells BEDTools which files to operate on, and the -d flag that we also want it to output the distance. BEDTools writes output to standard out, so we use ”>” to capture it in a text file.

Here is the bash script. I put the above R code in clean_files.R and added it as an Rscript line at the beginning, so I could run it all with one file.

Rscript clean_files.R

bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3391.gff3 \
    > closest_element_3391.txt
bedtools closest -d -a ensembl_transcript_start.gtf -b cleaned_3392.gff3 \
    > closest_element_3392.txt

Some results

With the resulting file we can go back to R and ggplot2 and draw cute graphs like this, which shows the distribution of distances from transcript to Hp1b peak for protein coding and noncoding transcripts separately. Note the different y-scales (there are way more protein coding genes in the annotation) and the 10-logarithm plus one transformation on the x-axis. The plus one is to show the zeroes; BEDTools returns a distance of 0 for transcripts that overlap a Hp1b site.

closest_3391 <- read.table("~/blogg/dmel_hp1/closest_element_3391.txt", header=F, sep="\t")

qplot(x=log10(V19 + 1), data=subset(closest_3391, V2 %in% c("protein_coding", "ncRNA"))) +
  facet_wrap(~V2, scale="free_y")


Or by merging the datasets from different antibodies, we can draw this strange beauty, which pretty much tells us that the antibodies do not give the same result in terms of the closest feature. To figure out how they differ, one would have to look more closely into the genomic distribution of the peaks.

closest_3392 <- read.table("~/blogg/dmel_hp1/closest_element_3392.txt", header=F, sep="\t")

combined <- merge(closest_3391, closest_3392,
                  by.x=c("V1", "V2","V4", "V5", "V9"),
                  by.y=c("V1", "V2","V4", "V5", "V9"))

qplot(x=log10(V19.x+1), y=log10(V19.y+1), data=combined)


(If you’re wondering about the points that end up below 0, those are transcripts where there are no peaks called on that chromosome in one of the datasets. BEDTools returns -1 for those that lack matching features on the same chromosome and R will helpfully transform them to -Inf.)

About the DGRP

The question mentioned the DGRP. I don’t know that anyone has looked at ChIP in the DGRP lines, but wouldn’t that be fun? Quantitative genetics of DNA binding protein variation in DGRP and integration with eQTL … What one could do already, though, is take the interesting sites of Hp1 binding and overlap them with the genetic variants of the DGRP lines. I don’t know if that would tell you much — does anyone know what kind of variant would affect Hp1 binding?

Happy hacking!

Written by mrtnj

4 juli, 2014 at 20:08

Journal club of one: ”Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene” (preprint)

with 4 comments

This preprint was posted on bioRxiv and Haldane’s sieve. It tells the story of one of the best known genetic variants affecting behaviour, the foraging gene in Drosophila melanogaster. for is still a nice example of a large-effect variant causing (developmentally) pleiotropic effects. However, Turner & al present evidence questioning whether for has any substantial effect in natural populations of flies. I think it’s self-evident why I’m interested.

They look at previous evidence for foraging as a quantitative trait gene in files sampled from natural populations and perform genome-wide association and population genetic tests with 35 DGRP lines, finding nothing at the for locus.


(Since this is a preprint, I will feel free to suggest what I think could be improvements to the manuscript. Obviously, these are just my opinions.)

I’m not convinced one can really separate a unimodal from a bimodal distribution with 36 data points? Below are a few histograms simulated from a mixture of two normal distributions where 25 samples are ”rovers” and 11 ”sitters”.


For fun, I also tested for normality with the Shapiro-Wilks’ test as the authors did, and about half of 1000 tests reject. My histograms should not be overinterpreted; I just generated two normal distributions with means log10(2.66) and log10(1.3) with standard deviations 0.1. I don’t know the actual standard deviations of the forS and forR reference strains. Of course, when the standard deviation is small enough, the distributions clearly separate and Shapiro-Wilks’ test will reject.

Power is difficult, but in this case the authors are looking at a well-known effect. They should be able to postulate some reasonable effect-sizes given the literature and the difference between the reference strains and make sure that they’re actually powered to detect it. 35 individuals for a GWAS is not much. They may still have good power to detect a effect of the size expected at for, at least in the single-point test, but it would be nice to demonstrate it. Power feels particularly pertinent as the authors claim to find evidence of absence. The same thing should apply to the population genetic tests, though it’s probably harder to know what effects to expect there.

The authors discuss alternative interpretations, and mention  the fact that in their hands the reference strains did not travel nearly as long as in previous experiments. How likely is it, though, that the variant isn’t segregating in Raleigh but in the populations previously sampled?


Thomas Turner, Christopher C Giauque, Daniel R Schrider, Andrew D Kern. (2014) Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene. Preprint on bioaRxiv. doi: 10.1101/004325

Written by mrtnj

30 april, 2014 at 19:59


Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 367 andra följare