Using R: reshape2 to tidyr

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.

The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

But Hadley Wickham’s definition is straightforward:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

library(reshape2)
fake_data <- data.frame(id = 1:20,
                        variable1 = runif(20, 0, 1),
                        variable2 = rnorm(20))
melted <- melt(fake_data, id.vars = "id")

The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.

So the is the data before:

  id   variable1    variable2
1  1 0.938173781  0.852098580
2  2 0.408216233  0.261269134
3  3 0.341325188  1.796235963
4  4 0.958889279 -0.356218000

And this is after. This time the data frame doesn’t become wider. There are still three columns. But we go from 20 rows to 40: two variables times 20 individuals.

  id  variable       value
1  1 variable1 0.938173781
2  2 variable1 0.408216233
3  3 variable1 0.341325188
4  4 variable1 0.958889279

And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.

The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.

The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.

Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.

Here are five different ways to get the same long form data frame as above:

library(tidyr)
melted <- gather(fake_data, variable, value, 2:3)

## Column names instead of indices
melted <- gather(fake_data, variable, value, variable1, variable2)

## Excluding instead of including
melted <- gather(fake_data, variable, value, -1)

## Excluding using column name
melted <- gather(fake_data, variable, value, -id)

## With pipe
melted <- fake_data %>% gather(variable, value, -id)

Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame:

## plyr
dcast(melted, id ~  variable)

## tidyr
spread(melted, variable, value)
Annonser

Peerage of Science Reviewer Prize 2017

I won a prize! Hurrah! I’m obviously very happy.

If you want to hear me answer a couple of questions and see the Peerage of Science crew engaged in some amusing video editing, look at the interview.

How did that happen? After being told, about a year ago to check out the peer review platform Peerage of Science, I decided to keep reviewing manuscripts that showed up and were relevant to my interests. Reading and commenting on unpublished manuscripts is stimulating, and I thought it would help improve my reviewing and, maybe, writing.

Maybe this is a testament to the power of gamification. I admit that I’ve occasionally been checking my profile to see what the score is even without thinking of any reviewer prize.

Griffin & Nesseth ”The science of Orphan Black: the official companion”

I didn’t know that science fiction series Orphan Black actually had a real Cosima: Cosima Herter, science consultant. After reading this interview and finishing season 5, I realised that there is also a new book I needed to read: The science of Orphan Black: The official companion by PhD candidate in development, stem cells and regenerative medicine Casey Griffin and science communicator Nina Nesseth with a foreword by Cosima Hertner.

(Warning: This post contains serious spoilers for Orphan Black, and a conceptual spoiler for GATTACA.)

One thing about science fiction struck me when I was watching the last episodes of Orphan Black: Sometimes it makes a lot more sense if we don’t believe everything the fictional scientists tell us. Like real scientists, they may be wrong, or they may be exaggerating. The genetically segregated future of GATTACA becomes no less chilling when you realise that the silly high predictive accuracies claimed are likely just propaganda from a oppressive society. And as you realise that the dying P.T. Westmorland is an imposter, you can break your suspension of disbelief about LIN28A as a fountain of youth gene … Of course, genetics is a little more complicated than that, and he is just another rich dude who wants science to make him live forever.

However, it wouldn’t be Orphan Black if there weren’t a basis in reality: there are several single gene mutations in model animals (e.g. Kenyon & al 1993) that can make them live a lot longer than normal, and LIN28A is involved in ageing (reviewed by Jun-Hao & al 2016). It’s not out of the question that an engineered single gene disruption that substantially increases longevity in humans could be possible. Not practical, and not necessarily without unpleasant side effects, but not out of the question.

Orphan Black was part slightly scary adventure, part festival of ideas about science and society, part character-driven web of relationships, and part, sadly, bricolage of clichés. I found when watching season five that I’d forgotten most of the plots of seasons two through four, and I will probably never make the effort to sit through them again. The first and last seasons make up for it, though.

The series seems to have been set on squeezing as many different biological concepts as possible in there, so the book has to try to do the same. It has not just clones and transgenes, but also gene therapy, stem cells, prion disease, telomeres, dopamine, ancient DNA, stem cells in cosmetics and so on. Two chapters try valiantly to make sense of the clone disease and the cure. It shows that the authors have encyclopedic knowledge of life science, with a special interest in development and stem cells.

But I think they slightly oversell how accurate the show is. Like when Cosima tells Scott to ”run a PCR on these samples, see if there are any genetic markers” and ”can you sequence for cytochrome c?”, and Scott replies ”the barcode gene? that’s the one we use for species differentiation” … That’s what screen science is like. The right words, but not always in the right order.

Cosima and Scott sciencing at university, before everything went pear-shaped. One of the good thing about Orphan Black was the scientist characters. There was a ton of them! The good ones, geniuses with sparse resources and self experimentation, the evil ones, well funded and deeply unethical, and Delphine. This scene is an exception in that it plays the cringe-inducing nerd angle. Cosima and Scott grew after than this.

There are some scientific oddities. They must be impossible to avoid. For example, the section on epigenetics treats it as a completely new field, sort of missing the history of the subfield. DNA methylation research was going on already in the 1970s (Gitschier 2009). Genomic imprinting, arguably the only solid example of transgenerational epigenetic effects in humans, and X inactivation were both being discovered during 70s and 80s (reviewed by Ferguson-Smith 2011). The book also makes a hash of genome sequencing, which is a shame but understandable. It would have taken a lot of effort to disentangle how sequencing worked when the fictional clone experiment started and how it got to how it works in season five, when Cosima runs Nanopore sequencing.

The idea of human cloning is evocative. Orphan Black flipped it on its head by making the main clone characters strikingly different. It also cleverly acknowledged that human cloning is a somewhat dated 20th century idea, and that the cutting edge of life science has moved on. But I wish the book had been harder on the premise of the clone experiment:

By cloning the human genome and fostering a set of experimental subjects from birth, the scientists behind the project would gain many insights into the inner workings of the human body, from the relay of genetic code into observable traits (called phenotypes), to the viability of manipulated DNA as a potential therapeutic tool, to the effects of environmental factors on genetics. It’s a scientifically beautiful setup to learn myriad things about ourselves as humans, and the doctors at Dyad were quick to jump at that opportunity. (Chapter 1)

This is the very problem. Of course, sometimes ethically atrocious fictional science would, in principle, generate useful knowledge. But when when fictional science is near useless, let’s not pretend that it would produce a lot of valuable knowledge. When it comes to genetics and complex traits like human health, small sample studies of this kind (even if it was using clones) would be utterly useless. Worse than useless, they would likely be biased and misleading.

Researchers still float the idea of a ”baseline”, though, but in the form of a cell line, where it makes more sense. See the the (Human) Genome Project-write (Boeke & al 2016), suggesting the construction of an ideal baseline cell line for understanding human genome function:

Additional pilot projects being considered include … developing a homozygous reference genome bearing the most common pan-human allele (or allele ancestral to a given human population) at each position to develop cells powered by ”baseline” human genomes. Comparison with this baseline will aid in dissecting complex phenotypes, such as disease susceptibility.

In the end, the most important part of science in science fiction isn’t to be a factually correct, nor to be a coherent prediction about the future. If Orphan Black has raised interest in science, and I’m sure it has, that is great. And if it has stimulated discussions about the relationship between biological science, culture and ethics, that is even better.

The timeline of when relevant scientific discoveries happened in the real world and in Orphan Black is great. The book has a partial bibliography. The ”Clone Club Q&A” boxes range from silly fun to great open questions.

Orphan Black was probably the best genetics TV show around, and this book is a wonderful companion piece.

Plaque at the Roslin Institute to the sheep that haunts Orphan Black. ”Baa.”

Literature

Boeke, JD et al (2016) The genome project-write. Science.

Ferguson-Smith, AC (2011) Genomic imprinting: the emergence of an epigenetic paradigm. Nature reviews Genetics.

Gitschier, J. (2009). On the track of DNA methylation: An interview with Adrian Bird. PLOS Genetics.

Jun-Hao, E. T., Gupta, R. R., & Shyh-Chang, N. (2016). Lin28 and let-7 in the Metabolic Physiology of Aging. Trends in Endocrinology & Metabolism.

Kenyon, C., Chang, J., Gensch, E., Rudner, A., & Tabtiang, R. (1993). A C. elegans mutant that lives twice as long as wild type. Nature, 366(6454), 461-464.

European Society for Evolutionary Biology congress, Groningen, 2017

The European Society for Evolutionary Biology meeting this year took place August 20–25 in Groningen, Netherlands. As usual, the meeting was great, with lots of good talks and posters. I was also happy to meet colleagues, including people from Linköping who I’ve missed a lot since moving.

Here are some of my subjective highlights:

There were several interesting talks in the recombination symposium, spanning from theory to molecular biology and from within-population variation to phylogenetic distances. For example: Irene Tiemann-Boege talked about recombination hotspot evolution from the molecular perspective with mutation bias and GC-biased gene conversion (Arbeithuber & al 2015), while Franciso Úbeda de Torres presented a population genetic model model of recombination hotspots. I would need to pore over the paper to understand what was going on and if the model solves the hotspot paradox (as the title said), and how it is different from his previous model (Úbeda & Wilkins 2011).

There were also talks about young sex chromosomes. Alison Wright talked about recombination suppression on the evolving guppy sex chromosomes (Wright & al 2017), and Bengt Hansson about the autosome–sex chromosome fusion in Sylvioidea birds (Pala & al 2012).

Piter Bijma gave two (!) talks on social genetic effects. That is when your trait value depends not just on your genotype, but on the genotype on others around you, a situation that is probably not at all uncommon. After all, animals often live in groups, and plants have to stay put where they are. One can model this, which leads to a slightly whacky quantitative genetics where heritable variance can be greater than the trait variance, and where the individual and social effects can cancel each other out and prevent response to selection.

I first heard about this at ICQG in Edinburgh a few years ago (if memory serves, it was Bruce Walsh presenting Bijma’s slides?), but have only made a couple of fairly idle and unsuccessful attempts to understand it since. I got the feeling that social genetic effects should have some bearing on debates about kin selection versus multilevel selection, but I’m not sure how it all fits together. It is nice that it comes with a way to estimate effects (given that we know which individuals are in groups together and their relatedness), and there are some compelling case studies (Wade & al 2010). On the other hand, separating social genetic effects from other social effects must be tricky; for example, early social environment effects can look like indirect genetic effects (Canario, Lundeheim & Bijma 2017).

Philipp Gienapp talked about using realised relatedness (i.e. genomic relationships a.k.a. throw all the markers into the model and let partial pooling sort them out) to estimate quantitative genetic parameters in the wild. There is a lot of relevant information in the animal breeding and human genetics literature, but applying these things in the wild comes with challenges that deserves some new research to sort things out. Evolutionary genetics, similar to human genetics, is more interested in parameter estimation than prediction of phenotypes or breeding values. On the other hand, human genetics methods often work on GWAS summary statistics. In this way, evolutionary genetics is probably more similar to breeding. Also, the relatedness structure of the the populations may matter. Evolution happens in all kinds of populations, large and small, structured and well-mixed. Therefore, evolutionary geneticists may work with populations that are different from those in breeding and human genetics.

For example, someone asked about estimating genetic correlations with genomic relationships. There are certainly animal breeding and human genetics papers about realised relatedness and genetic correlation (Jia & Jannik 2012, Visscher & al 2014 etc), because of course, breeders need to deal a lot with correlated traits and human geneticists really like finding genetic correlations between different GWAS traits.

Speaking of population structure, Fst scans are still all the rage. There was a lot of discussion about trying to find regions of the genome that stand out as more differentiated in closely related populations (”genomic islands of speciation/divergence/differentiation”), and as less differentiated in mostly separated populations (introgression, possibly adaptive). But it’s not just Fst outliers. It’s encouraging to see different kinds of quantitative and population genomic methods applied in the same systems. On the hybrid and introgression side of things, Leslie Turner (Turner & Harr 2014) and Jun Kitano (Ravinet & al 2017) gave interesting talks on mice and sticklebacks, respectively. Danièle Filiaut showed an super impressive integrative GWAS and selection mapping study of local adaptation in Swedish Arabidopsis thaliana (Kedaffrec & al 2016).

Susan Johnston spoke about recombination mapping in Soay sheep and Rum deer (Johnston & al 2016, 2017). Given how few large long term genetic studies like this there are, it’s marvelous to be see the same kind of analysis in two parallel systems. Jason Munshi-South gave what seemed like a fascinating talk about rodent evolution in New York City (Harris & Munshi-South 2017). Unfortunately, too many other people thought so too, and I mostly failed to eavesdrop form the corridor.

Finally, Nina Wedell gave a wonderful presidential address about Evolution in the 21th century. ”Because I can. I’m the president now.” Yes!

The talk was about threats to evolutionary biology, examples of it’s usefulness and a series of calls to action. I liked the part about celebrating science much more than the common call to explain science to people. You know, like you hear at seminars and the march for science: We need to ”get out there” (where?) and ”explain what we’re doing” (to whom?). Because if it is true that science and scientists are being questioned, then scientists should speak in a way that works even if they’re not starting by default from a position of authority. Scientists need not just explain the science, but justify why the science is worth listening to in the first place.

”As your current president, I encourage you to celebrate evolution!”

I think this is precisely right, and it made me so happy. Of course, it leaves questions like ”What does that mean?”, ”How do we do it?”, but as a two word slogan, I think it is perfect.

Celebration aligns with sound rhetorical strategy in two ways. First, explanation is fine when someone asks for it, or is otherwise already disposed to listen to an explanation. But otherwise, it is more important to awaken interest and a positive state of mind before laying out the facts. (I can’t claim to be any kind of rhetorics expert. But see Rhetoric: for Herennius, Book I, V-VII for ancient wisdom on the topic.) By the way, I’m sure this is what people who are good at science communication actually do. Second, celebration means concentrating on the excitement and wonder, and the good things science can do. In that way, it prevents the trap of listing all the bad things that will happen if Trumpists, creationists and anti-vaccine activists get their way.

Nina Wedell also gave examples of the usefulness of evolution: biomimicry, directed evolution of enzymes, the power of evolutionary algorithms, plant and animal breeding, and prevention of resistance to herbicides and antibiotics. These are all good, worthy things, but also quite a limited subset of evolutionary biology? Maybe this idea is that evolutionary biology should be a basic science supporting applications like these. In line with that, she brought up how serendipitous useful things can come from studying strange diverse organisms and figuring out how they do things. The example in talk was the CRISPR–Cas system. Similar stories apply to a other proteins used as biomedical and biotechnology tools, such as Taq polymerase and Green fluorescent protein.

I have to question a remark about reproducibility, though. The list of threats included ”critique of the scientific method” and concerns over reproducibility, as if this was something that came from outside of science. I may have misunderstood. It was a very brief comment. But if problems with reproducibility are a threat to science, and I think they can be, then it’s not just a problem of image but a problem with how scientists perform, analyse, and report their science.

Evolutionary biology hasn’t been in the reproducibility crisis news the same way as psychology or behavioural genetics, but I don’t know if that is because of better quality, or just that no one has looked that carefully for the problems. There are certainly contradictory results here too, and the same overly flexible data analysis and selective reporting practices that cause problems elsewhere must be common in evolution too. I can think of some reasons why evolutionary biology may be better off. Parts of the field default to analysing data with multilevel or mixed models. Mixed models are not perfect, but they help with some multiple testing problems by fitting and partially pooling a lot of coefficients in the same model. Also, studies that use classical model organisms may be able to get a lot of replication, low variance, and large sample sizes in a way that is impossible for example with human experiments.

So I don’t know if there is a desperate need for large initiatives for replication of key results, preregistration of studies, and improvement of data analysis practice in evolution; there may or there may not. But wouldn’t it still be wonderful if we had them?

Bingo! I don’t have a ton of photos from Groningen, but here is my conference bingo card. Note what conspicuously isn’t filled in: the poster sessions took place in nice big room, and were not that loud. In retrospect, I probably didn’t go to enough of the non-genetic inheritance talks, and I should’ve put Fisher 1930 instead of 1918.

”These are all fairly obvious” (says Sewall Wright)

I was checking a quote from Sewall Wright, and it turned out that the whole passage was delightful. Here it is, from volume 1 of Genetics and the Evolution of Populations (pages 59-60):

There are a number of broad generalizations that follow from this netlike relationship between genome and complex characters. These are all fairly obvious but it may be well to state them explicitly.

1) The variations of most characters are affected by a great many loci (the multiple factor hypothesis).

2) In general, each gene replacement has effects on many characters (the principle of universal pleiotropy).

3) Each of the innumerable possible alleles at any locus has a unique array of differential effects on taking account of pleiotropy (uniqueness of alleles).

4) The dominance relation of two alleles is not an attribute of them but of the whole genome and of the environment. Dominance may differ for each pleiotropic effect and is in general easily modifiable (relativity of dominance).

5) The effects of multiple loci on a character in general involve much nonadditive interaction (universality of interaction effects).

6) Both ontogenetic and phylogenetic homology depend on calling into play similar chains of gene-controlled reactions under similar developmental conditions (homology).

7) The contributions of measurable characters to overall selective value usually involve interaction effects of the most extreme sort because of the usually intermediate position of the optimum grade, a situation that implies the existence of innumerable different selective peaks (multiple selective peaks).

What can we say about this?

It seems point one is true. People may argue about whether the variants behind complex traits are many, relatively common, with tiny individual effects or many, relatively rare, and with larger effects that average out to tiny effects when measured in the whole population. In any case, there are many causative variants, alright.

Point two — now also known as the omnigenetic model — hinges on how you read ”in general”, I guess. In some sense, universal pleiotropy follows from genome crowding. If there are enough causative variants and a limited number of genes, eventually every gene will be associated with every trait.

I don’t think that point three is true. I would assume that many loss of function mutations to protein coding genes, for example, would be interchangeable.

I don’t really understand points six and seven, about homology and fitness landscapes, that well. The later section about homology reads to me as if it could be part of a debate going on at the time. Number seven describes Wright’s view of natural selection as a kind of fitness whack-a-mole, where if a genotype is fit in one dimension, it probably loses in some other. The hypothesis and the metaphor have been extremely influential — I think largely because many people thought that it was wrong in many different ways.

Points four and five are related and, I imagine, the most controversial of the list. Why does Wright say that there is universal epistasis? Because of physiological genetics. Or, in modern parlance, maybe because of gene networks and systems biology. On page 71, he puts it like this:

Interaction effects necessarily occur with respect to the ultimate products of chains of metabolic processes in which each step is controlled by a different locus. This carries with it the implication that interaction effects are universal in the more complex characters that trace such processes.

The argument seems to persists to this day, and I think it is true. On the other hand, there is the question how much this matters to the variants that actually segregate in a given population and affect a given trait.

This is often framed as a question of variance. It turns out that even with epistatic gene action, in many cases, most of the genetic variance is still additive (Mäki-Tanila & Hill 2014, Huang & Mackay 2016). But something similar must apply to the effects that you will see from a locus. They also depend on the allele frequencies at other loci. An interaction does nothing when one of the interaction partners are fixed. If they are nearly to fixed, it will do nearly nothing. If they’re all at intermediate frequency, things become more interesting.

Wright’s principle of universal interaction is also grounded in his empirical work. A lot of space in this book is devoted to results from pigmentation genetics in guinea pigs, which includes lots of dominance and interaction. It could be that Wright was too quick to generalise from guinea pig coat colours to other traits. It could be that working in a system consisting of inbred lines draws your attention to nonlinearities that are rare and marginal in the source populations. On the other hand, it’s in these systems we can get a good handle on the dominance and interaction that may be missed elsewhere.

Study of effects in combination indicates a complicated network of interacting processes with numerous pleiotropic effects. There is no reason to suppose that a similar analysis of any character as complicated as melanin pigmentation would reveal a simpler genetic system. The inadequacy of any evolutionary theory that treats genes as if they had constant effects, favourable or unfavourable, irrespective of the rest of the genome, seems clear. (p. 88)

I’m not that well versed in pigmentation genetics, but I hope that someone is working on this. In an era where we can identify the molecular basis of classical genetic variants, I hope that someone keeps track of all these A, C, P, Q etc, and to what extent they’ve been mapped.

Literature

Wright, Sewall. ”Genetics and the Evolution of Populations” Volume 1 (1968).

Mäki-Tanila, Asko, and William G. Hill. ”Influence of gene interaction on complex trait variation with multilocus models.” Genetics 198.1 (2014): 355-367.

Huang, Wen, and Trudy FC Mackay. ”The genetic architecture of quantitative traits cannot be inferred from variance component analysis.” PLoS genetics 12.11 (2016): e1006421.

20170705_183042.jpg

Yours truly outside the library on Thomas Bayes’ road, incredibly happy with having found the book.

See you at #eseb2017

I’m going to Groningen for the European Society for Evolutionary Biology meeting on the 20th to 25th of August.

Given what I’m currently working on, I’m especially excited about the symposium on applications of evolutionary biology in agriculture and industry and also the sprawling three-day genomics of adaptation symposium, but I assume that there will be an abundance of interesting talks and posters all over the place.

If you are there, say hello!

Scripting for data analysis (with R)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.