Scripting for data analysis (with R)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Data frames
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.

Summer of data science 1: Genomic prediction machines #SoDS17

Genetics is a data science, right?

One of my Summer of data science learning points was to play with out of the box prediction tools. So let’s try out a few genomic prediction methods. The code is on GitHub, and the simulated data are on Figshare.

Genomic selection is the happy melding of quantitative and molecular genetics. It means using genetic markers en masse to predict traits and and make breeding decisions. It can give you better accuracy in choosing the right plants or animals to pair, and it can allow you to take shortcuts by DNA testing individuals instead of having to test them or their offspring for the trait. There are a bunch of statistical models that can be used for genomic prediction. Now, the choice of prediction algorithm is probably not the most important part of genomic selection, but bear with me.

First, we need some data. For this example, I used AlphaSim (Faux & al 2016), and the AlphaSim graphical user interface, to simulate a toy breeding population. We simulate 10 chromosomes á 100 cM, with 100 additively acting causal variants and 2000 genetic markers per chromosome. The initial genotypes come from neutral simulations. We run one generation of random mating, then three generations of selection on trait values. Each generation has 1000 individuals, with 25 males and 500 females breeding.

So we’re talking a small-ish population with a lot of relatedness and reproductive skew on the male side. We will use the two first two generations of selection (2000 individuals) to train, and try to predict the breeding values of the fourth generation (1000 individuals). Let’s use two of the typical mixed models used for genomic selection, and two tree methods.

We start by splitting the dataset and centring the genotypes by subtracting the mean of each column. Centring will not change predictions, but it may help with fitting the models (Strandén & Christensen 2011).

Let’s begin with the workhorse of genomic prediction: the linear mixed model where all marker coefficients are drawn from a normal distribution. This works out to be the same as GBLUP, the GCTA model, GREML, … a beloved child has many names. We can fit it with the R package BGLR. If we predict values for the held-out testing generation and compare with the real (simulated) values, it looks like this. The first panel shows a comparison with phenotypes, and the second with breeding values.

This gives us correlations of 0.49 between prediction and phenotype, and 0.77 between prediction and breeding value.

This is a plot of the Markov chain Monte Carlo we use to sample from the model. If a chain behaves well, it is supposed to have converged on the target distribution, and there is supposed to be low autocorrelation. Here is a trace plot of four chains for the marker variance (with the coda package). We try to be responsible Bayesian citizens and run the analysis multiple times, and with four chains we get very similar results from each of them, and a potential scale reduction factor of 1.01 (it should be close to 1 when it works). But the autocorrelation is high, so the chains do not explore the posterior distribution very efficiently.

BGLR can also fit a few of the ”Bayesian alphabet” variants of the mixed model. They put different priors on the distribution of marker coefficients allow for large effect variants. BayesB uses a mixture prior, where a lot of effects are assumed to be zero (Meuwissen, Hayes & Goddard 2001). The way we simulated the dataset is actually close to the BayesB model: a lot of variants have no effect. However, mixture models like BayesB are notoriously difficult to fit — and in this case, it clearly doesn’t work that well. The plots below show chains for two BayesB parameters, with potential scale reduction factors of 1.4 and 1.5. So, even if the model gives us the same accuracy as ridge regression (0.77), we can’t know if this reflects what BayesB could do.

On to the trees! Let’s try Random forest and Bayesian additive regression trees (BART). Regression trees make models as bifurcating trees. Something like the regression variant of: ”If the animal has a beak, check if it has a venomous spur. If it does, say that it’s a platypus. If it doesn’t, check whether it quacks like a duck …” The random forest makes a lot of trees on random subsets of the data, and combines the inferences from them. BART makes a sum of trees. Both a random forest (randomForest package) and a BART model on this dataset (fit with bartMachine package), gives a lower accuracy — 0.66 for random forest and 0.72 for BART. This is not so unexpected, because the strength of tree models seems to lie in capturing non-additive effects. And this dataset, by construction, has purely additive inheritance. Both BART and random forest have hyperparameters that one needs to set. I used package defaults for random forest, values that worked well for Waldmann (2016), but one probably should choose them by cross validation.

Finally, we can use classical quantitative genetics to estimate breeding values from the pedigree and relatives’ trait values. Fitting the so called animal model in two ways (pedigree package and MCMCglmm) give accuracies of 0.59 and 0.60.

So, in summary, we recover the common wisdom that the linear mixed model does the job well. It was more accurate than just pedigree, and a bit better than BART. Of course, the point of this post is not to make a fair comparison of methods. Also, the real magic of genomic selection, presumably, happens on every step of the way. How do you get to that neat individual-by-marker matrix in the first place, how do you deal with missing data and data from different sources, what and when do you measure, what do you do with the predictions … But you knew that already.

Journal club of one: ”An expanded view of complex traits: from polygenic to omnigenic”

An expanded view of complex traits: from polygenic to omnigenic” by Boyle, Yang & Pritchard (2017) came out recently in Cell. It has been all over Twitter, and I’m sure it will influence a lot of people’s thinking — rightfully so. It is a good read, pulls in a lot of threads, and has a nice blend of data analysis and reasoning. It’s good. Go read it!

The paper argues that for a lot of quantitative traits — specifically human diseases and height — almost every gene will be associated with every trait. More than that, almost every gene will be causally involved in every trait, most in indirect ways.

It continues with the kind of analysis used in Pickrell (2014), Finucane & al (2015) among many others, that break genome-wide association down down by genome annotation. How much variability can we attribute to variants in open chromatin regions? How much to genes annotated as ”protein bindning”? And so on.

These analyses point towards gene regulation being important, but not that strongly towards particular annotation terms or pathways. The authors take this to mean that, while genetic mapping, including GWAS, finds causally involved genes, it will not necessarily find ”relevant” genes. That is, not necessarily genes that are the central regulators of the trait. That may be a problem if you want to use genetic mapping to find drug targets, pathways to engineer, or similar.

This observation must speak to anyone who has looked at a list of genes from some mapping effort and thought: ”well, that is mostly genes we know nothing about … and something related to cancer”.

They write:

In summary, for a variety of traits, the largest-effect variants are modestly enriched in specific genes or pathways that may play direct roles in disease. However, the SNPs that contribute the bulk of the heritability tend to be spread across the genome and are not near genes with disease-specific functions. The clearest pattern is that the association signal is broadly enriched in regions that are transcriptionally active or involved in transcriptional regulation in disease-relevant cell types but absent from regions that are transcriptionally inactive in those cell types. For typical traits, huge numbers of variants contribute to heritability, in striking consistency with Fisher’s century-old infinitesimal model.

I summary: it’s universal pleiotropy. I don’t think there is any reason to settle on ”cellular” networks exclusively. After all, cells in a multicellular organism share a common pool of energy and nutrients, and exchange all kinds of signalling molecules. This agrees with classical models and the thinking in evolutionary genetics (see Rockman & Paaby 2013). Or look at this expression QTL and gene network study in aspen (Mähler & al 2017): the genes with eQTL tend to be peripheral, not network hub genes.

It’s a bit like in behaviour genetics, where people are fond of making up these elaborate hypothetical causal stories: if eyesight is heritable, and children with bad eyesight get glasses, and the way you treat a child who wears glasses somehow reinforces certain behaviours, so that children who wear glasses grow up to score a bit better on certain tests — are the eyesight variants also ”intelligence variants”? This is supposed to be a reductio ad absurdum of the idea of calling anything an ”intelligence variant” … But I suspect that this is what genetic causation, when fully laid out, will sometimes look like. It can be messy. It can involve elements that we don’t think of as ”relevant” to the trait.

There are caveats, of course:

One reason that there is a clearer enrichment of variant-level annotation such as open chromatin than in gene-level annotation may be that the resolution is higher. We don’t really know that much about how molecular variation translates to higher level trait variation. And let’s not forget that for most GWAS hits, we don’t know the causative gene.

They suggest defining ”core genes” like this: ”conditional on the genotype and expres-
sion levels of all core genes, the genotypes and expression levels of peripheral genes no longer matter”. Core genes are genes that d-separate the peripheral genes from a trait. That makes sense. Some small number of genes may be necessary molecular intermediates for a trait. But as far as I can tell, it doesn’t follow that useful biological information only comes from studying core genes, nor does it follow that we can easily tell if we’ve hit a core or a peripheral gene.

Also, there are quantitative genetics applications of GWAS data that are agnostic of pathways and genes. If we want to use genetics for prediction, for precision medicine etc, we do not really need to know the functions of the causative genes. We need big cohorts, well defined trait measurements, good coverage of genetic variants, and a good idea of environmental risk factors to feed into prediction models.

It’s pretty entertaining to see the popular articles about this paper, and the juxtaposition of quotes like ”that all those big, expensive genome-wide association studies may wind up being little more than a waste of time” (Gizmodo) with researchers taking the opportunity to bring up up their favourite hypotheses about missing heritability — even if it’s not the same people saying both things. Because if we want to study rare variants, or complex epistatic interactions, or epigenomics, or what have you, the studies will have to be just as big and expensive, probably even more so.

Just please don’t call it ”omnigenetics”.


Boyle, Evan A., Yang I. Li, and Jonathan K. Pritchard. ”An Expanded View of Complex Traits: From Polygenic to Omnigenic.” Cell 169.7 (2017): 1177-1186.

Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

results <- ddply(some_data, "key", function(x) {
  ## do something; return data.frame()

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) {
  ## do something; return data.frame()
result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

grouped <- group_by(some_data, key)
result <- do(grouped, calculate(.))

Or once again with magrittr:

some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key)
## will not work: Error: Results are not data frames at positions ...
try(result <- do(grouped, calculate))

Don’t forget the dot!

Mutation, selection, and drift (with Shiny)

Imagine a gene that comes in two variants, where one of them is deleterious to the carrier. This is not so hard to imagine, and it is often the case. Most mutations don’t matter at all. Of those that matter, most are damaging.

Next, imagine that the mutation happens over and over again with some mutation rate. This is also not so hard. After all, given enough time, every possible DNA sequence should occur, as if by monkeys and typewriters. In this case, since we’re talking about the deleterious mutation rate, we don’t even need exactly the same DNA sequence to occur; rather, what is important is how often a class of mutations with the same consequences happen.

Let’s illustrate this with a Shiny app! I made this little thing that draws graphs like this:

This is supposed to show the trajectory of a deleterious genetic variant, with sliders to decide the population size, mutation rate, selection, dominance, and starting frequency. The lines are ten replicate populations, followed for 200 generations. The red line is the estimated equilibrium frequency — where the population would end up if it was infinitely large and not subject to random chance.

The app runs here:
And the code is here:

(Note: I don’t know how well this will work if every blog reader clicks on that link. Maybe it all crashes or the bandwidth runs out or whatnot. If so, you can always download the code and run in RStudio.)

We assume diploid genetics, random mating, and mutation only in one direction (broken genes never restore themselves). As in typical population genetics texts, we call the working variant ”A” and the working variant ”a”, and their frequencies p and q. The genotypes AA, Aa and aa will have frequencies p^2 , 2 p q and q^2 before selection.

Damaging variants tend to be recessive, that is, they hurt only when you have two of them. Imagine an enzyme that makes some crucial biochemical product, that you need some but not a lot of. If you have one working copy of the enzyme, you may be perfectly fine, but if you are left without any working copy, you will have a deficit. We can describe this by a dominance coefficient called h. If the dominance coefficient is one, the variant is completely dominant, so that it damages you even if you only have one copy. If the dominance coefficient is zero, the variant is completely recessive, and having one copy of it does not affect you at all.

The average reproductive success (”fitness”) of each genotype is described in terms of selection coefficients, which tells us how much selection there is against a genotype. Selection coefficients range from 0, which means that you’re winning, to 1 which means that you’ve been completely out-competed. For a recessive damaging variant, the AA homozygotes and Aa heterozygotes are unaffected, but the aa homozygotes suffers selection coefficient s.

In the general case, fitness values for each genotype are 1 for AA, 1 - hs for Aa and 1 - s for aa. We can think of this as the probability of contributing to the next generation.

What about the red line in the graphs? If natural selection keeps removing a mutation from the gene pool, and mutation keeps adding it back in again there may be some equilibrium frequency where they cancel out, and the frequency of the damaging variant is more or less constant. This is called mutation–selection balance.

Haldane (1937) came up with an expression for the equilibrium variant frequency:

q_{eq} = \frac {h s + \mu - \sqrt{ (hs - \mu)^2 + 4 s \mu } } {2 h s - 2 s}

I’ve changed his notation a bit to use h and s for dominance and selection coefficient. \mu is the mutation rate. It’s not easy to see what is going on here, but we can draw it in the graph, and see that it’s usually very small. In these small populations, where drift is a major player, the variants are often completely lost, or drift to higher frequency by chance.

(I don’t know if I can recommend learning by playing with an app, but I definitely learned things while making it. For instance that C++11 won’t work on unless you send the compiler a flag, and that it’s important to remember that both variants in a diploid organism can mutate. So I guess what I’m saying is: don’t use my app, but make your own. Or something.)


Haldane, J. B. S. ”The effect of variation of fitness.” The American Naturalist 71.735 (1937): 337-349.

Using R: a function that adds multiple ggplot2 layers

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:


diet1 <- subset(ChickWeight, Diet == 1)
diet2 <- subset(ChickWeight, Diet == 2)

This is just an example that shows the phenomenon. The first two functions will work, but combining them won’t.

add_line <- function(df) {
  geom_line(aes(x = Time, y = weight, group = Chick), data = df)

add_points <- function(df) {
  geom_point(aes(x = Time, y = weight), data = df)

add_line_points <- function(df) {
  add_line(df) + add_points(df)

## works
(plot1 <- ggplot() + add_line(diet1) + add_points(diet1))

## won't work: non-numeric argument to binary operator
try((plot2 <- ggplot() + add_line_points(diet1)))

Update: In the comments, Eric Pedersen gave a neat solution: stick the layers in a list and add the list. Like so:

(plot2.5 <- ggplot() + list(add_line(diet1), add_points(diet1)))

Nice! I did not know that one.

Also, you can get the same result by putting mappings and data in the ggplot function. This will work if all layers are going to plot the same data, but that does it for some cases:

## bypasses the issue by putting mappings in ggplot()
(plot3 <- ggplot(aes(x = Time, y = weight, group = Chick), data = diet1) +
    geom_line() + geom_point())

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

## bypasses the issue and gives a similar feel with pipes


add_line_points2 <- function(plot, df, ...) {
  plot +
    geom_line(aes(x = Time, y = weight, group = Chick), ..., data = df) +
    geom_point(aes(x = Time, y = weight), ..., data = df)

(plot4 <- ggplot() %>% add_line_points2(diet1) %>%
   add_line_points2(diet2, colour = "red"))

Finally, in many cases, one can stick all the data in a combined data frame, and avoid building up the plot from different data frames altogether.

## plot the whole dataset at once
(plot5 <- ggplot(aes(x = Time, y = weight, group = Chick, colour = Diet),
                 data = ChickWeight) +
   geom_line() + geom_point())

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time is just one more mapping in aes(). No looping or custom functions required.

I hope that was of some use.

Using R: Don’t save your workspace

To everyone learning R: Don’t save your workspace.

When you exit an R session, you’re faced with the question of whether or not to save your workspace. You should almost never answer yes. Saving your workspace creates an image of your current variables and functions, and saves them to a file called ”.RData”. When you re-open R from that working directory, the workspace will be loaded, and all these things will be available to you again. But you don’t want that, so don’t save your workspace.

Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. You end up having to put an inordinate trust in your old self. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace.

What should one do instead? One should source the script often, ideally from freshly minted R sessions, to make sure to always be working with a script that runs and does what it’s supposed to. Storing a data frame in the workspace can seem comforting, but what happens the day I overwrite it by mistake? Don’t save your workspace.

Yes, I’m exaggerating. When using any modern computer system, we rely on saved information and saved state all the time. And yes, every time a computation takes too much time to reproduce, one should write it to a file to load every time. But I that should be a deliberate choice, worthy of its own save() and load() calls, and certainly not something one does with simple stuff that can be reproduced a the blink of an eye. Put more trust in your script than in your memory, and don’t save your workspace.