Morning coffee: multilevel drift

20170204_185122

There is an abstract account of natural selection (Lewontin 1970) where one observes that any population of entities, whatever they may be, will evolve through natural selection if (1) there is variation, that (2) affects reproductive success, and (3) is heritable.

I don’t know how I missed this before, but it recently occurred to me that there must be a similarly abstract account of drift, where a population will evolve through drift if there is (1) variation, (2) that is heritable, and (3) sampling due to finite population size.

Drift may not be negligible, especially since at a higher level of organization, the population size should be smaller, making natural selection relatively less efficient.

It seems dplyr is overtaking correlation heatmaps

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

library(tidyr)
library(dplyr)
library(magrittr)

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
  group_by(sex, treatment, variable) %>%
  summarise(mean = mean(value), sd = sd(value))

Row by row we:

1-3: Load the packages.

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

    sex treatment  variable        mean        sd
  (dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059

Peerage of science, first impressions

After I wrote a post about reviewing papers, Craig Primmer suggested on Twitter that I look into Peerage of Science. Peerage of Science is a portal and community for peer review. It has a lot of good ideas. It decouples reviewing from journal submission, but it is still made for papers aimed to be published in a conventional journal. It collects reviewers and manuscripts from a different fields in one place, allows interested reviewers to select papers they want to review, and provides anonymity (if the authors want it). I once wrote a few sentences about what I thought ”optimal peer review” would be like, for a PLOS early career researchers’ travel grant. (I did not get the grant.) My ideas for better peer review were probably not that bright, or that realistic, but they did share several features with the Peerage of Science model. Naturally, I was interested.

I’ve tried reviewing for Peerage of Science for a couple of months. My first impression is that it seems to work really well. The benefits are quite obvious: I’ve seen some of the papers get more reviews than they would typically get at a journal, and the reviews usually seem no less elaborate. The structured form for reviewing is helpful, and corresponds well with what I think a good review should be like. I think I’ll stick around, look out for the notifications, and jump in when a paper is close to my interests. I really hope enough people will use Peerage of Science for it to be successful.

There are also downsides to this model:

There seems to be an uneven allocation of reviewer effort. Some papers have a lot of reviewers, but some have only one. Of course, only the people at Peerage of Science know the actual distribution of reviews. Maybe one reviewer processes are actually very rare! This is a bit like post-publication review, except that there, you can at least know who else has already commented on a paper. I know some people think that this is a good thing. Papers that attract interest also attract scrutiny, and thus reviewer effort is directed towards where it is most needed. But I think that in the ideal case, every paper would be reviewed thoroughly. This could be helped by an indicator of how many other reviewers have engaged, or at least already posted their essays.

There is also the frustration of coming late to a process where one feels the reviewers have done a poor job. This was my first experience. I joined a review process that was at its last stages, and found a short, rather sloppy review that missed most of what I thought were the important points, and belaboured what I thought was a non-issue. Too late did I realize that I could do nothing about it.

Who reviews the reviewers? The reviewers do. I see the appeal of scoring and weighting reviews. It certainly makes reviewing more of a learning experience, which must be a good thing. But I feel rather confused about what I am supposed to write as reviewer feedback. Evidently, I’m not alone, because people seem to put rather different things in the feedback box.

Since the Peerage of Science team have designed the whole format and platform, I assume that every part of the process is thought through. The feedback forms, the prompts that are shown with each step, the point at which different pieces of information is revealed to you — this is part of a vision of better peer review. But sometimes, that vision doesn’t fully make sense to me. For example, if the authors want to sign their manuscripts, Peerage of Science has the following ominous note for them:

Peerage of Science encourages Authors to remain anonymous during the review process, to ensure unbiased peer review and publishing decisions. Reviewers usually expect this too, and may perceive signed submissions as attempts to influence their evaluation and respond accordingly.

Also, I’d really really really love to be able to turn down the frequency of email notifications. In the last four days, I’ve gotten more than one email a day about review processes I’m involved in, even if I can’t do anything more until after the next deadline.

Using R: tibbles and the t.test function

A participant in the R course I’m teaching showed me a case where a tbl_df (the new flavour of data frame provided by the tibble package; standard in new RStudio versions) interacts badly with the t.test function. I had not seen this happen before. The reason is this:

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame (tibble announcement on RStudio blog)

Here is code that reproduces the situation (tibble version 1.2):

data(chickwts)
chick_tibble <- as_tibble(chickwts)
casein <- subset(chickwts, feed == "casein")
sunflower <- subset(chick_tibble, feed == "sunflower")
t.test(sunflower$weight, casein$weight) ## this works
t.test(as.data.frame(sunflower[, 1]), as.data.frame(casein[, 1])) ## this works too
t.test(sunflower[, 1], casein[, 1]) ## this doesn't

Error: Unsupported use of matrix or array for column indexing

I did not know that. The solution, which they found themselves, is to use as.data.frame.

I can see why not dropping to a vector makes sense. I’m sure you’ve at some point expected a data frame and got an ”$ operator is invalid for atomic vectors”. But it’s an unfortunate fact that number of weird little thingamajigs to remember is always strictly increasing as the language evolves. And it’s a bit annoying that the standard RStudio setup breaks code that uses an old stats function, even if it’s in somewhat non-obvious way.

EBM 2016, Marseille

In September, I went to the 20th Evolutionary Biology Meeting in Marseille. This is a very nice little meeting. I listened to a lot of talks, had some very good conversations, met some people, and presented our effort to map domestication traits in the chicken with quantitative trait locus mapping and gene expression (Johnsson & al 2015, 2016, and some unpublished stuff).

Time for a little conference report. Late, but this time less than a year from the actual conference. Here are some of my highlights:

Richard Cordaux on pill bugs, Wolbachia and sex manipulation — I did not know that Wolbachia, the intracellular parasite superstar of arthropods, had feminization of hosts in its repertoire (Cordaux & al 2004). Not only that, but in some populations of pill bugs, a large chunk of the genome of the feminizing Wolbachia has inserted into the pill bug genome, thus forming a new W chromosome (Leclercq & al 2016, published since the conference). He also told me how this is an example of the importance of preserving genetic resources — the lines of pill bugs have been maintained for a long time, and now they’re able to return to them with genomics tools and continue old lines of research. I think that is seriously cool.

Olaya Rendueles Garcia on positive frequency-dependent selection maintaining diversity in social bacterium Myxococcus xanthus (Rendueles, Amherd & Velicer 2015) — In my opinion, this was the best talk of the conference. It had everything: an interesting phenomenon, a compelling study system, good visuals and presentation. In short: M. xanthus of the same genotype tend to cooperate, inhabit their own little turfs in the soil, and exclude other genotypes. So it seems positive frequency-dependent selection maintains diversity in this case — diversity across patches, that is.

A very nice thing about this kind of meetings is that one gets a look into the amazing diversity of organisms. Or as someone put it: the complete and utter mess. In this department, I was particularly struck by … Sally Leys — sponges; Marie-Claude Marsolier-Kergoat — bison; Richard Dorrell — stramenopile chloroplasts.

I am by no means a transposable elements person. In fact, one might believe I was actively avoiding transposable elements by my choice of study species. But transposable elements are really quite interesting, and seem quite important to genome evolution, both to neutrally evolving and occasionally adaptive sequences. This meeting had a good transposon session, with several interesting talks.

Anton Crombach presented models the gap gene network in Drosophila melanogaster and Megaselia abdita, with some evolutionary perspectives (Crombach & al 2016). A couple of years ago, Marjoram, Zubair & Nuzhdin used the gap gene network as their example model to illustrate the suggestion to combine systems biology models with genetic mapping. I very much doubt (though I may be wrong; it happens a lot) that there is much meaningful variation within populations in the gap gene network. A between-species analysis seems much more fruitful, and leads to the interesting result where the outcome, in terms of gap gene expression along the embryo, is pretty similar but the way that the system gets there is quite different.

If you’ve had a beer with me and talked about the future of quantitative genetics, you’re pretty likely to have heard me talk about how in the bright future, we will not just map variation in phenotypes, but in the parameters of dynamical models. (I also think that the mapping will take place through fully Bayesian hierarchical models where the same posterior can be variously summarized for doing genomic prediction or for mapping the major quantitative trait genes, interactions etc. Of course, setting up and running whole-genome long read sequencing will be as convenient and cheap as an overnight PCR. And generally, there will be pie in the sky etc.) At any rate, what Anton Crombach showed was an example of combining systems biology modelling with variation (between clades). I thought it was exciting.

It was fun to hear Didier Raoult, one of the discoverers of giant viruses, speak. He was somewhat of a quotation machine.

”One of the major problems in biology is that people believe what they’ve learned.”

(About viruses being alive or not) ”People ask: are they alive, are they alive? I don’t care, and they don’t care either”

Very entertaining, and quite fascinating stuff about giant viruses.

If there are any readers out there who worry about social media ruining science by spilling the beans about unpublished results presented at meetings, do not worry. There were a few more cool unpublished things. Conference participants, you probably don’t know who you are, but I eagerly await your papers.

I think this will be the last evolution-themed conference for me in a while. The EBM definitely has a different range of themes than the others I’ve been to: ESEB, or rather: the subset of ESEB I see choosing my adventure through the multiple-session programme, and the Swedish evolution meetings. There was more molecular evolution, more microorganisms and even some orgin of life research.

Morning coffee: against validation and optimization

20160924_130554

It appears like I’m accumulating pet peeves at an alarming rate. In all probability, I am guilty of most of them myself, but that is no reason not to complain about them on the internet. For example: Spend some time in a genetics lab, and you will probably hear talk of ”validation” and ”optimization”. But those things rarely happen in a lab.

According to a dictionary, to ”optimize” means to make something as good as possible. That is almost never possible, nor desirable. What we really do is change things until they work according to some accepted standard. That is not optimization; that is tweaking.

To ”validate” means to confirm to that something is true, which is rarely possible. Occasionally we have something to compare to that you are really sure about, so that if a method agrees with it, we can be pretty certain that it works. But a lot of time, we don’t know the answer. The best we can do is to gather additional evidence.

Additional evidence, ideally from some other method with very different assumptions, is great. So is adjusting a protocol until it performs sufficiently well. So why not just say what we mean?

”You keep using that word. I do not think that it means what you think it means.”

A year ago in Lund: the panel discussion at Evolution in Sweden 2016

This meeting took place on the 13th and 14th of January 2016 in Lund. It feels a bit odd to write about it now, but my blog is clearly in a state of anachronistic anarchy as well as ett upphöjt tillstånd av språklig förvirring, so that’s okay. It was a nice meeting, spanning quite a lot of things, from mosasaurs to retroviruses. It ended with a panel discussion of sorts that made me want to see more panel discussions at meetings.

The panel consisted of Anna-Liisa Laine, Sergey Gavrilets, Per Lundberg, Niklas Wahlberg, and Charlie Cornwallis, and a lot of people joined in with comments. I don’t know how the participants were chosen (Anna-Liisa Laine and Sergey Gavrilets were the invited speakers, so they seem like obvious choices), or how they were briefed; Per Lundberg served as a moderator and asked the other participants about their predictions about the future of the field (if memory serves me right).

I thought some of the points were interesting. One of Sergey Gavrilets’ three anticipated future developments was links between different levels of organisation; he mentioned systems biology and community ecology in the same breath. This sounded interesting to me, who not so secretly dreams of the day when systems biology, quantitative genetics, and populations genetics can all be brought to bear on the same phenotypes. (The other two directions of research he brought up were cliodynamics and human evolution.) He himself had, earlier in his talk, provided an example where a model of human behaviour shows the possibility of something interesting — that a kind of cooperation or drive for equality can be favoured without anything like kin or group selection. That is, in some circumstances it pays to protect the weak, and thus make sure that they bullies do not get too much ahead. He said something to the effect that now is the time to apply evolutionary biology to humans. I would disagree with that. On the one hand, if you are interested in studying humans, any time is the time. On the other hand, if the claim is that now, evolutionary biology is mature and solid, so one can go out and apply it to help other disciplines to sort out their problems … I think that would be overly optimistic.

A lot of the discussion was about Mats Björklund‘s talk about predicting evolution, or failing to do so. Unfortunately, I think he had already left, and this was the one talk of the conference that I missed (due to dull practical circumstances stemming from a misplaced wallet), so this part of the discussion mostly passed me by.

A commonplace that recurred a few times was jokes about sequencing … this or that will not be solved by sequencing thousands of genomes, or by big data — you know the kind. This is true, of course; massively parallel sequencing is good when you want to 1) make a new reference genome sequence; 2) get lots and lots of genetic markers or 3) quantify sequences in some library. That certainly doesn’t cover all of evolutionary biology, but it is still quite useful. Every time this came up part of me felt like putting my hand up to declare that I do in fact think that sequencing thousands of individuals is a good idea. But I didn’t, so I write it here where even fewer people will read it.

This is (according to my notes) what the whiteboard said at the end of the session:

”It’s complicated …”
”We need more data …”
”Predictions are difficult/impossible”
”We need more models”

Business as usual
Eventually we’ll get there (where?)
Revise assumptions, models, theories, methods, what to measure

Nothing in evolutionary biology makes sense except in the light of ecology phylogeny disease

Everything in evolution makes sense in the light of mangled Dobzhansky quotes.

(Seriously, I get why pastiches of this particular quote are so common: It’s a very good turn of phrase, and one can easily substitute the scientific field and the concept one thinks is particularly important. Nothing in behavioural ecology makes sense except in the light of Zahavi’s handicap principle etc. It is a fun internal joke, but at the same time sounds properly authoritative. Michael Lynch’s version sometimes seems to be quoted in the latter way.)