Summer of data science 1: Genomic prediction machines #SoDS17

Genetics is a data science, right?

One of my Summer of data science learning points was to play with out of the box prediction tools. So let’s try out a few genomic prediction methods. The code is on GitHub, and the simulated data are on Figshare.

Genomic selection is the happy melding of quantitative and molecular genetics. It means using genetic markers en masse to predict traits and and make breeding decisions. It can give you better accuracy in choosing the right plants or animals to pair, and it can allow you to take shortcuts by DNA testing individuals instead of having to test them or their offspring for the trait. There are a bunch of statistical models that can be used for genomic prediction. Now, the choice of prediction algorithm is probably not the most important part of genomic selection, but bear with me.

First, we need some data. For this example, I used AlphaSim (Faux & al 2016), and the AlphaSim graphical user interface, to simulate a toy breeding population. We simulate 10 chromosomes á 100 cM, with 100 additively acting causal variants and 2000 genetic markers per chromosome. The initial genotypes come from neutral simulations. We run one generation of random mating, then three generations of selection on trait values. Each generation has 1000 individuals, with 25 males and 500 females breeding.

So we’re talking a small-ish population with a lot of relatedness and reproductive skew on the male side. We will use the two first two generations of selection (2000 individuals) to train, and try to predict the breeding values of the fourth generation (1000 individuals). Let’s use two of the typical mixed models used for genomic selection, and two tree methods.

We start by splitting the dataset and centring the genotypes by subtracting the mean of each column. Centring will not change predictions, but it may help with fitting the models (Strandén & Christensen 2011).

Let’s begin with the workhorse of genomic prediction: the linear mixed model where all marker coefficients are drawn from a normal distribution. This works out to be the same as GBLUP, the GCTA model, GREML, … a beloved child has many names. We can fit it with the R package BGLR. If we predict values for the held-out testing generation and compare with the real (simulated) values, it looks like this. The first panel shows a comparison with phenotypes, and the second with breeding values.

This gives us correlations of 0.49 between prediction and phenotype, and 0.77 between prediction and breeding value.

This is a plot of the Markov chain Monte Carlo we use to sample from the model. If a chain behaves well, it is supposed to have converged on the target distribution, and there is supposed to be low autocorrelation. Here is a trace plot of four chains for the marker variance (with the coda package). We try to be responsible Bayesian citizens and run the analysis multiple times, and with four chains we get very similar results from each of them, and a potential scale reduction factor of 1.01 (it should be close to 1 when it works). But the autocorrelation is high, so the chains do not explore the posterior distribution very efficiently.

BGLR can also fit a few of the ”Bayesian alphabet” variants of the mixed model. They put different priors on the distribution of marker coefficients allow for large effect variants. BayesB uses a mixture prior, where a lot of effects are assumed to be zero (Meuwissen, Hayes & Goddard 2001). The way we simulated the dataset is actually close to the BayesB model: a lot of variants have no effect. However, mixture models like BayesB are notoriously difficult to fit — and in this case, it clearly doesn’t work that well. The plots below show chains for two BayesB parameters, with potential scale reduction factors of 1.4 and 1.5. So, even if the model gives us the same accuracy as ridge regression (0.77), we can’t know if this reflects what BayesB could do.

On to the trees! Let’s try Random forest and Bayesian additive regression trees (BART). Regression trees make models as bifurcating trees. Something like the regression variant of: ”If the animal has a beak, check if it has a venomous spur. If it does, say that it’s a platypus. If it doesn’t, check whether it quacks like a duck …” The random forest makes a lot of trees on random subsets of the data, and combines the inferences from them. BART makes a sum of trees. Both a random forest (randomForest package) and a BART model on this dataset (fit with bartMachine package), gives a lower accuracy — 0.66 for random forest and 0.72 for BART. This is not so unexpected, because the strength of tree models seems to lie in capturing non-additive effects. And this dataset, by construction, has purely additive inheritance. Both BART and random forest have hyperparameters that one needs to set. I used package defaults for random forest, values that worked well for Waldmann (2016), but one probably should choose them by cross validation.

Finally, we can use classical quantitative genetics to estimate breeding values from the pedigree and relatives’ trait values. Fitting the so called animal model in two ways (pedigree package and MCMCglmm) give accuracies of 0.59 and 0.60.

So, in summary, we recover the common wisdom that the linear mixed model does the job well. It was more accurate than just pedigree, and a bit better than BART. Of course, the point of this post is not to make a fair comparison of methods. Also, the real magic of genomic selection, presumably, happens on every step of the way. How do you get to that neat individual-by-marker matrix in the first place, how do you deal with missing data and data from different sources, what and when do you measure, what do you do with the predictions … But you knew that already.

Journal club of one: ”An expanded view of complex traits: from polygenic to omnigenic”

An expanded view of complex traits: from polygenic to omnigenic” by Boyle, Yang & Pritchard (2017) came out recently in Cell. It has been all over Twitter, and I’m sure it will influence a lot of people’s thinking — rightfully so. It is a good read, pulls in a lot of threads, and has a nice blend of data analysis and reasoning. It’s good. Go read it!

The paper argues that for a lot of quantitative traits — specifically human diseases and height — almost every gene will be associated with every trait. More than that, almost every gene will be causally involved in every trait, most in indirect ways.

It continues with the kind of analysis used in Pickrell (2014), Finucane & al (2015) among many others, that break genome-wide association down down by genome annotation. How much variability can we attribute to variants in open chromatin regions? How much to genes annotated as ”protein bindning”? And so on.

These analyses point towards gene regulation being important, but not that strongly towards particular annotation terms or pathways. The authors take this to mean that, while genetic mapping, including GWAS, finds causally involved genes, it will not necessarily find ”relevant” genes. That is, not necessarily genes that are the central regulators of the trait. That may be a problem if you want to use genetic mapping to find drug targets, pathways to engineer, or similar.

This observation must speak to anyone who has looked at a list of genes from some mapping effort and thought: ”well, that is mostly genes we know nothing about … and something related to cancer”.

They write:

In summary, for a variety of traits, the largest-effect variants are modestly enriched in specific genes or pathways that may play direct roles in disease. However, the SNPs that contribute the bulk of the heritability tend to be spread across the genome and are not near genes with disease-specific functions. The clearest pattern is that the association signal is broadly enriched in regions that are transcriptionally active or involved in transcriptional regulation in disease-relevant cell types but absent from regions that are transcriptionally inactive in those cell types. For typical traits, huge numbers of variants contribute to heritability, in striking consistency with Fisher’s century-old infinitesimal model.

I summary: it’s universal pleiotropy. I don’t think there is any reason to settle on ”cellular” networks exclusively. After all, cells in a multicellular organism share a common pool of energy and nutrients, and exchange all kinds of signalling molecules. This agrees with classical models and the thinking in evolutionary genetics (see Rockman & Paaby 2013). Or look at this expression QTL and gene network study in aspen (Mähler & al 2017): the genes with eQTL tend to be peripheral, not network hub genes.

It’s a bit like in behaviour genetics, where people are fond of making up these elaborate hypothetical causal stories: if eyesight is heritable, and children with bad eyesight get glasses, and the way you treat a child who wears glasses somehow reinforces certain behaviours, so that children who wear glasses grow up to score a bit better on certain tests — are the eyesight variants also ”intelligence variants”? This is supposed to be a reductio ad absurdum of the idea of calling anything an ”intelligence variant” … But I suspect that this is what genetic causation, when fully laid out, will sometimes look like. It can be messy. It can involve elements that we don’t think of as ”relevant” to the trait.

There are caveats, of course:

One reason that there is a clearer enrichment of variant-level annotation such as open chromatin than in gene-level annotation may be that the resolution is higher. We don’t really know that much about how molecular variation translates to higher level trait variation. And let’s not forget that for most GWAS hits, we don’t know the causative gene.

They suggest defining ”core genes” like this: ”conditional on the genotype and expres-
sion levels of all core genes, the genotypes and expression levels of peripheral genes no longer matter”. Core genes are genes that d-separate the peripheral genes from a trait. That makes sense. Some small number of genes may be necessary molecular intermediates for a trait. But as far as I can tell, it doesn’t follow that useful biological information only comes from studying core genes, nor does it follow that we can easily tell if we’ve hit a core or a peripheral gene.

Also, there are quantitative genetics applications of GWAS data that are agnostic of pathways and genes. If we want to use genetics for prediction, for precision medicine etc, we do not really need to know the functions of the causative genes. We need big cohorts, well defined trait measurements, good coverage of genetic variants, and a good idea of environmental risk factors to feed into prediction models.

It’s pretty entertaining to see the popular articles about this paper, and the juxtaposition of quotes like ”that all those big, expensive genome-wide association studies may wind up being little more than a waste of time” (Gizmodo) with researchers taking the opportunity to bring up up their favourite hypotheses about missing heritability — even if it’s not the same people saying both things. Because if we want to study rare variants, or complex epistatic interactions, or epigenomics, or what have you, the studies will have to be just as big and expensive, probably even more so.

Just please don’t call it ”omnigenetics”.


Boyle, Evan A., Yang I. Li, and Jonathan K. Pritchard. ”An Expanded View of Complex Traits: From Polygenic to Omnigenic.” Cell 169.7 (2017): 1177-1186.

Morning coffee: multilevel drift


There is an abstract account of natural selection (Lewontin 1970) where one observes that any population of entities, whatever they may be, will evolve through natural selection if (1) there is variation, that (2) affects reproductive success, and (3) is heritable.

I don’t know how I missed this before, but it recently occurred to me that there must be a similarly abstract account of drift, where a population will evolve through drift if there is (1) variation, (2) that is heritable, and (3) sampling due to finite population size.

Drift may not be negligible, especially since at a higher level of organization, the population size should be smaller, making natural selection relatively less efficient.

Peerage of science, first impressions

After I wrote a post about reviewing papers, Craig Primmer suggested on Twitter that I look into Peerage of Science. Peerage of Science is a portal and community for peer review. It has a lot of good ideas. It decouples reviewing from journal submission, but it is still made for papers aimed to be published in a conventional journal. It collects reviewers and manuscripts from a different fields in one place, allows interested reviewers to select papers they want to review, and provides anonymity (if the authors want it). I once wrote a few sentences about what I thought ”optimal peer review” would be like, for a PLOS early career researchers’ travel grant. (I did not get the grant.) My ideas for better peer review were probably not that bright, or that realistic, but they did share several features with the Peerage of Science model. Naturally, I was interested.

I’ve tried reviewing for Peerage of Science for a couple of months. My first impression is that it seems to work really well. The benefits are quite obvious: I’ve seen some of the papers get more reviews than they would typically get at a journal, and the reviews usually seem no less elaborate. The structured form for reviewing is helpful, and corresponds well with what I think a good review should be like. I think I’ll stick around, look out for the notifications, and jump in when a paper is close to my interests. I really hope enough people will use Peerage of Science for it to be successful.

There are also downsides to this model:

There seems to be an uneven allocation of reviewer effort. Some papers have a lot of reviewers, but some have only one. Of course, only the people at Peerage of Science know the actual distribution of reviews. Maybe one reviewer processes are actually very rare! This is a bit like post-publication review, except that there, you can at least know who else has already commented on a paper. I know some people think that this is a good thing. Papers that attract interest also attract scrutiny, and thus reviewer effort is directed towards where it is most needed. But I think that in the ideal case, every paper would be reviewed thoroughly. This could be helped by an indicator of how many other reviewers have engaged, or at least already posted their essays.

There is also the frustration of coming late to a process where one feels the reviewers have done a poor job. This was my first experience. I joined a review process that was at its last stages, and found a short, rather sloppy review that missed most of what I thought were the important points, and belaboured what I thought was a non-issue. Too late did I realize that I could do nothing about it.

Who reviews the reviewers? The reviewers do. I see the appeal of scoring and weighting reviews. It certainly makes reviewing more of a learning experience, which must be a good thing. But I feel rather confused about what I am supposed to write as reviewer feedback. Evidently, I’m not alone, because people seem to put rather different things in the feedback box.

Since the Peerage of Science team have designed the whole format and platform, I assume that every part of the process is thought through. The feedback forms, the prompts that are shown with each step, the point at which different pieces of information is revealed to you — this is part of a vision of better peer review. But sometimes, that vision doesn’t fully make sense to me. For example, if the authors want to sign their manuscripts, Peerage of Science has the following ominous note for them:

Peerage of Science encourages Authors to remain anonymous during the review process, to ensure unbiased peer review and publishing decisions. Reviewers usually expect this too, and may perceive signed submissions as attempts to influence their evaluation and respond accordingly.

Also, I’d really really really love to be able to turn down the frequency of email notifications. In the last four days, I’ve gotten more than one email a day about review processes I’m involved in, even if I can’t do anything more until after the next deadline.

EBM 2016, Marseille

In September, I went to the 20th Evolutionary Biology Meeting in Marseille. This is a very nice little meeting. I listened to a lot of talks, had some very good conversations, met some people, and presented our effort to map domestication traits in the chicken with quantitative trait locus mapping and gene expression (Johnsson & al 2015, 2016, and some unpublished stuff).

Time for a little conference report. Late, but this time less than a year from the actual conference. Here are some of my highlights:

Richard Cordaux on pill bugs, Wolbachia and sex manipulation — I did not know that Wolbachia, the intracellular parasite superstar of arthropods, had feminization of hosts in its repertoire (Cordaux & al 2004). Not only that, but in some populations of pill bugs, a large chunk of the genome of the feminizing Wolbachia has inserted into the pill bug genome, thus forming a new W chromosome (Leclercq & al 2016, published since the conference). He also told me how this is an example of the importance of preserving genetic resources — the lines of pill bugs have been maintained for a long time, and now they’re able to return to them with genomics tools and continue old lines of research. I think that is seriously cool.

Olaya Rendueles Garcia on positive frequency-dependent selection maintaining diversity in social bacterium Myxococcus xanthus (Rendueles, Amherd & Velicer 2015) — In my opinion, this was the best talk of the conference. It had everything: an interesting phenomenon, a compelling study system, good visuals and presentation. In short: M. xanthus of the same genotype tend to cooperate, inhabit their own little turfs in the soil, and exclude other genotypes. So it seems positive frequency-dependent selection maintains diversity in this case — diversity across patches, that is.

A very nice thing about this kind of meetings is that one gets a look into the amazing diversity of organisms. Or as someone put it: the complete and utter mess. In this department, I was particularly struck by … Sally Leys — sponges; Marie-Claude Marsolier-Kergoat — bison; Richard Dorrell — stramenopile chloroplasts.

I am by no means a transposable elements person. In fact, one might believe I was actively avoiding transposable elements by my choice of study species. But transposable elements are really quite interesting, and seem quite important to genome evolution, both to neutrally evolving and occasionally adaptive sequences. This meeting had a good transposon session, with several interesting talks.

Anton Crombach presented models the gap gene network in Drosophila melanogaster and Megaselia abdita, with some evolutionary perspectives (Crombach & al 2016). A couple of years ago, Marjoram, Zubair & Nuzhdin used the gap gene network as their example model to illustrate the suggestion to combine systems biology models with genetic mapping. I very much doubt (though I may be wrong; it happens a lot) that there is much meaningful variation within populations in the gap gene network. A between-species analysis seems much more fruitful, and leads to the interesting result where the outcome, in terms of gap gene expression along the embryo, is pretty similar but the way that the system gets there is quite different.

If you’ve had a beer with me and talked about the future of quantitative genetics, you’re pretty likely to have heard me talk about how in the bright future, we will not just map variation in phenotypes, but in the parameters of dynamical models. (I also think that the mapping will take place through fully Bayesian hierarchical models where the same posterior can be variously summarized for doing genomic prediction or for mapping the major quantitative trait genes, interactions etc. Of course, setting up and running whole-genome long read sequencing will be as convenient and cheap as an overnight PCR. And generally, there will be pie in the sky etc.) At any rate, what Anton Crombach showed was an example of combining systems biology modelling with variation (between clades). I thought it was exciting.

It was fun to hear Didier Raoult, one of the discoverers of giant viruses, speak. He was somewhat of a quotation machine.

”One of the major problems in biology is that people believe what they’ve learned.”

(About viruses being alive or not) ”People ask: are they alive, are they alive? I don’t care, and they don’t care either”

Very entertaining, and quite fascinating stuff about giant viruses.

If there are any readers out there who worry about social media ruining science by spilling the beans about unpublished results presented at meetings, do not worry. There were a few more cool unpublished things. Conference participants, you probably don’t know who you are, but I eagerly await your papers.

I think this will be the last evolution-themed conference for me in a while. The EBM definitely has a different range of themes than the others I’ve been to: ESEB, or rather: the subset of ESEB I see choosing my adventure through the multiple-session programme, and the Swedish evolution meetings. There was more molecular evolution, more microorganisms and even some orgin of life research.

Morning coffee: against validation and optimization


It appears like I’m accumulating pet peeves at an alarming rate. In all probability, I am guilty of most of them myself, but that is no reason not to complain about them on the internet. For example: Spend some time in a genetics lab, and you will probably hear talk of ”validation” and ”optimization”. But those things rarely happen in a lab.

According to a dictionary, to ”optimize” means to make something as good as possible. That is almost never possible, nor desirable. What we really do is change things until they work according to some accepted standard. That is not optimization; that is tweaking.

To ”validate” means to confirm to that something is true, which is rarely possible. Occasionally we have something to compare to that you are really sure about, so that if a method agrees with it, we can be pretty certain that it works. But a lot of time, we don’t know the answer. The best we can do is to gather additional evidence.

Additional evidence, ideally from some other method with very different assumptions, is great. So is adjusting a protocol until it performs sufficiently well. So why not just say what we mean?

”You keep using that word. I do not think that it means what you think it means.”

A year ago in Lund: the panel discussion at Evolution in Sweden 2016

This meeting took place on the 13th and 14th of January 2016 in Lund. It feels a bit odd to write about it now, but my blog is clearly in a state of anachronistic anarchy as well as ett upphöjt tillstånd av språklig förvirring, so that’s okay. It was a nice meeting, spanning quite a lot of things, from mosasaurs to retroviruses. It ended with a panel discussion of sorts that made me want to see more panel discussions at meetings.

The panel consisted of Anna-Liisa Laine, Sergey Gavrilets, Per Lundberg, Niklas Wahlberg, and Charlie Cornwallis, and a lot of people joined in with comments. I don’t know how the participants were chosen (Anna-Liisa Laine and Sergey Gavrilets were the invited speakers, so they seem like obvious choices), or how they were briefed; Per Lundberg served as a moderator and asked the other participants about their predictions about the future of the field (if memory serves me right).

I thought some of the points were interesting. One of Sergey Gavrilets’ three anticipated future developments was links between different levels of organisation; he mentioned systems biology and community ecology in the same breath. This sounded interesting to me, who not so secretly dreams of the day when systems biology, quantitative genetics, and populations genetics can all be brought to bear on the same phenotypes. (The other two directions of research he brought up were cliodynamics and human evolution.) He himself had, earlier in his talk, provided an example where a model of human behaviour shows the possibility of something interesting — that a kind of cooperation or drive for equality can be favoured without anything like kin or group selection. That is, in some circumstances it pays to protect the weak, and thus make sure that they bullies do not get too much ahead. He said something to the effect that now is the time to apply evolutionary biology to humans. I would disagree with that. On the one hand, if you are interested in studying humans, any time is the time. On the other hand, if the claim is that now, evolutionary biology is mature and solid, so one can go out and apply it to help other disciplines to sort out their problems … I think that would be overly optimistic.

A lot of the discussion was about Mats Björklund‘s talk about predicting evolution, or failing to do so. Unfortunately, I think he had already left, and this was the one talk of the conference that I missed (due to dull practical circumstances stemming from a misplaced wallet), so this part of the discussion mostly passed me by.

A commonplace that recurred a few times was jokes about sequencing … this or that will not be solved by sequencing thousands of genomes, or by big data — you know the kind. This is true, of course; massively parallel sequencing is good when you want to 1) make a new reference genome sequence; 2) get lots and lots of genetic markers or 3) quantify sequences in some library. That certainly doesn’t cover all of evolutionary biology, but it is still quite useful. Every time this came up part of me felt like putting my hand up to declare that I do in fact think that sequencing thousands of individuals is a good idea. But I didn’t, so I write it here where even fewer people will read it.

This is (according to my notes) what the whiteboard said at the end of the session:

”It’s complicated …”
”We need more data …”
”Predictions are difficult/impossible”
”We need more models”

Business as usual
Eventually we’ll get there (where?)
Revise assumptions, models, theories, methods, what to measure

Nothing in evolutionary biology makes sense except in the light of ecology phylogeny disease

Everything in evolution makes sense in the light of mangled Dobzhansky quotes.

(Seriously, I get why pastiches of this particular quote are so common: It’s a very good turn of phrase, and one can easily substitute the scientific field and the concept one thinks is particularly important. Nothing in behavioural ecology makes sense except in the light of Zahavi’s handicap principle etc. It is a fun internal joke, but at the same time sounds properly authoritative. Michael Lynch’s version sometimes seems to be quoted in the latter way.)