Archive for the ‘english’ Category
I’ve written before about importing tabular text files into R, and here comes some more. This is because I believe (firmly) that importing data is the major challenge for beginners who want to analyse their data in R. What is the most important thing about using any statistics software? To get your data into it in the first place! Unfortunately, no two datasets are the same and many frustrations await the beginner. Therefore, let me share a few things that I’ve picked up while trying to read data into R.
General tip: keep it as simple as possible. Open your favourite spreadsheet software, remove all the colours and formatting, make sure to have only one header row with short and simple column names (avoid any special characters; R will turn them to full stops), remove empty columns above and to the left of the table and export the sheet in to plain text, tab separated or comma separated. Then use either read.table, read.csv or the Import dataset button in RStudio to read your table, and in case of doubt, begin with the default settings, which are often sensible. If you see any annoying data import errors, below are a few possible causes.
If you don’t get the number of columns you expect
Then it’s often a separator problem. read.table allows you to set the filed separator character (with sep=) and the decimal separator (with dec=). To get a tab character, use ”\t”:
data <- read.table(file="data.txt", sep="\t", dec=",")
In many countries this is not an issue, but the Swedish standard is using a comma as decimal separator, while R uses a decimal point. If you export a comma separated value file on a Swedish computer, you are likely to get a numbers with decimal commas and semicolons as field separators. Then what you want is read.csv2( ), which is a read.csv made for semicolon separated files with decimal commas.
As an additional benefit of using it, RStudio solves those issues for you with the Import dataset button. The dialog box that appears when you click it lets you choose separators, header line (yes/no) and whether there are quotes around fields, and shows a preview of what the table will look like. Then it builds the read.table command for you, so that you can copy it into your script. If you’re not an RStudio user, just look at the actual file in a text editor, and add dec= and sep= instructions to your read.table as needed.
If you see a lot more columns than you expect, usually after the ones you expect, filled with NAs, it’s probably because you’ve happened to enter some character (very likely a whitespace …) somewhere to the right in the spreadsheet. When exported, the software will fill in all the empty cells, which R will interpret as missing values. If this happens, make sure to delete the contents of the spreadsheet after the columns you’re interested in.
Whitespaces can also be a problem if you’ve specified a custom separator, like above. Normally, read.table is very good about skipping over whitespace, should you have happened to put an extra whitespace in before that factor level. That functionality is turned off when you specify a separator, though, so you might need to switch it on again by setting strip.white=T in the function call. So, for instance, ” male” and ”female ” are interpreted as male/female, which is probably what one wants. This cost me several grey hairs and string operations before I came to my senses and read the data import manual.
If columns are not the type you expect
If something is a character when it should be numeric you might see messages such as ”‘x’ must be numeric ” or ”non-numeric argument to binary operator”. If something is a factor when it should be character, some character operations might fail. Regardless, it could still be a separator problem. If R expects to see decimal points and sees a comma, it will make that column a character vector rather then a numeric column. The principle is that read.table it will try to turn the column into numeric or integer (or complex or logical, but they are rarer). Most often, this means that if you feed it numbers it will make them numeric. But if there is anything else in any of the elements, it will assume character — and by default it will convert character vector into factors.
There are several reasons why you might have text among the numbers, making R not interpret the column as numeric. We’ve mentioned wrong separators, but sometimes if you paste spreadsheets from different sources together you end up with inconsistent field separators. I’ve written a post about a way to deal with that in some cases, but you can also search and replace the file with a text editor, or use a command-line tool like sed. Missing values can be another problem. read.table understands that empty fields are missing, but maybe you or your collaborator has used another character to mean missing, like ”-” or ”?”. This is dealt with by specifying your own na.strings:
data <- read.table("data.txt", na.strings=c("NA", "-", "?"))
R’s default of interpreting characters as factors is sometimes a good choice, but often what you really want is characters that you can work with and then, if needed, turn into factors. In particular, if your column of sample ids is turned into a factor, comparing and matching ids will sometimes hurt you. To avoid that you can specify stringsAsFactors=F. This works when you make data frames with the data.frame function as well.
data <- read.table("data.txt", stringsAsFactors=F)
Diagnosing problems with your data frame
Look at the dimensions of your data frame and the column of the classes. dim gives dimensions; the class function gives the type of a column and the str function will give you a summary of the structure of the object.
dim(data) lapply(data, class) str(data)
This recent paper, Pandey & al (2014), made me interested because I’m in the business of finding genes for traits, and have spent quite some time looking at lists of gene names and annotation database output. One is tempted to look for the ”outstanding candidates” that ”make biological sense” (quotes intended as scare quotes), but the truth is probably that no-one knows what genes and functions we should expect to be affected by genetic variation in, for instance, behaviour. This paper tries to make the case for the unknown parts of the brain transcriptome; they use data about gene expression, protein domains, paralogs and literature to argue that the unknown genes are unknown for no good reason and that they might be just as important as genes that happen to be well-known.
They found genes that are had a high ratio of expression in brain to average expression in other tissues of C57BL/6J and DBA/2J mice and searched PubMed for these genes in combination with neuroscience-related keywords. Some of them have few citations and these are their selectively expressed but little studied genes. They then make a series of comparisons between these and well-studied genes. It turns out the only major difference is that well-studied genes were discovered (entered into GenBank) earlier.
I don’t know to what extent these results are suprising. I was not surprised by their main conclusion, but then again, that maybe my opinion was mostly prejudice. There is a literature on biases in the functional genomics literature, but I don’t know much about it. And apparently neither did the authors, initially, as Robert Williams writes in a comment on the PLOS ONE website:
We did not rediscover the lovely work of Robert Hoffmann (now head of WikiGene) until the paper had been submitted in succession to six higher profile journals … Hoffmann and colleagues showed that social factors account for much of the annotation imbalance for genes.
I love the idea of authors writing an informal comment about the background of the paper like this.
The coexpression network results show some of the little known genes are just as connected as known important genes. This suggest some of the unknown genes might be important too, if we can trust that coexpression hub genes are likely to be important (for various values of ”important”). Maybe this is a scientific opportunity for some neuroscientist. Several people I’ve talked with has imagined future Big Science initiatives to describe the function of unknown genes — ”divide them up between labs and characterise them!” — and some initiatives exist, such as the IMPC. On the other hand, how do we know that we really find the most important and interesting functions of a gene? The skeptic in me thinks that going bottom up, from gene to phenotype, will miss the most interesting surprising phenotypes.
I think ”ignorome” is one of those unnecessary bad omics words, which is why I’ve avoided using it.
Their PubMed query was restricted to mouse, human and rat. I wonder why. Maybe there could be something useful from fruit flies or roundworms?
Overall, a fun paper that I recommend reading over a few cups of coffee!
Pandey AK, Lu L, Wang X, Homayouni R, Williams RW (2014) Functionally Enigmatic Genes: A Case Study of the Brain Ignorome. PLoS ONE 9(2): e88889. doi:10.1371/journal.pone.0088889
Apparently, this turned out to be my most popular post ever. Of course there are lots of things to say about the heatmap (or quilt, tile, guilt plot etc), but what I wrote was literally just a quick celebratory post to commemorate that I’d finally grasped how to combine reshape2 and ggplot2 to quickly make this colourful picture of a correlation matrix.
However, I realised there is one more thing that is really needed, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. We use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, it’s very easy to get a scale that has a midpoint and a different colour in each direction. It’s called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!
data <- airquality[,1:4] library(ggplot2) library(reshape2) qplot(x=Var1, y=Var2, data=melt(cor(data, use="p")), fill=value, geom="tile") + scale_fill_gradient2(limits=c(-1, 1))
I recently got an email from a person at Packt publishing, who suggested I write a book for them about ggplot2. My answer, which is perfectly true, is that I don’t have the time, nor the expertise to do that. What I didn’t say is that 1) a quick web search suggests that Packt doesn’t have the best reputation and 2) there are already two books about ggplot2 that I think covers the entire field: the indispensable ggplot2 book, written by Hadley Wickham, the author of the package, and the R Graphics Cookbok by Wincent Chang. There are too many decent but not great R books on the market already and there is no reason for me to spend time to create another one.
However, there are a few things I’d like to tell the novice, in general, about ggplot2:
1. ggplot2 is not necessarily superior to lattice or base graphics. It’s largely a matter of taste, you can make very nice plots with either plotting system and in some situations everybody will turn away from their favourite system and use another. For instance, a lot of built-in diagnostic plots come preprogrammed in base graphics. The great thing about ggplot2 is how it allows the user to think about plots as layers of mappings between variables and geometric objects. Base graphics and lattice are organised in different ways, which is not necessarily worse; it depends on the way you’re accustomed to thinking about statistical graphics.
2. There are two ways to start a plot in ggplot2: qplot( ) or ggplot( ). qplot is the quicker way and probably the one you should learn first. But don’t be afraid of the ggplot function! All it does is set up an empty plot and connect a data frame to it. After that you can add your layers and map your variables almost the same way you would do with qplot. After you’ve become comfortable with qplot, just try building plots with ggplot a few times, and you’ll see how similar it is.
3. The magic is not in plotting the data but in tidying and rearranging the data for plotting. Make sure to put all your labels, indicators and multiple series of data into the same data frame (most of the time: just cbind or merge them together), melt the data frame and pass it to ggplot2. If you want to layer predictions on top of your data, put output from your model in another data frame. It is perfectly possible, often advisable, to write functions that generate ggplot2 plots, but make sure to always create the data frame to be plotted first and then pass it on. I suggest not trying to create the mappings, that is x= and y= and the like, on the fly. There is always the risk of messing up the order of the vectors, and also, because ggplot2 uses metaprogramming techniques for the aesthetics, you might see unexpected behaviours when putting function calls into the mapping.
4. Worry about mapping variables and facetting first and then change the formatting. Because of how plots as well as settings in ggplot2 are objects that you can store and pass around, you can first create a raw version of the plot, and then just add (yes add, with the ”+” operator) the formatting options you want. So taken together, the workflow for making any ggplot2 plot goes something like this: 1) put your data frame in order; 2) set up a basic plot with the qplot or ggplot function; 3) add one extra geom at at time (optional if you make a simple plot with qplot, since qplot sets up the first geometry for you); 4) add the settings needed to make the plot look good.
A couple of weeks ago I attended the Evolution in Sweden meeting in Uppsala, as expected a very nice meeting with lots of interesting things. My last conference was ESEB last summer, which was great because it was a huge conference with so much to see and so many people. Evolution in Sweden was great because it wasn’t huge, so that it was very possible to see everything, recognise familiar faces and talk with people. I had a poster on the behaviour genetics of chicken domestication (of course!).
Here are some of my personal highlights, in no particular order:
Kerstin Johannesson’s talk, an ”advertisement for marine organsims” was probably the most fun and engaging. I was very convinced that evolutionary research in the Baltic Sea is a great idea! Among other things she mentioned salinity gradients, the sexual and asexual reproduction of Fucus brown algae, Littorina saxatilis of course and the IMAGO project to sequence and assemble reference genomes for eight different species from the Baltic.
We have a great infrastructure for evolutionary research: the Baltic Sea. [quoted from memory]
Claudia Köhler spoke about why triploids in Arabidopsis thaliana fail, which is an interesting story involving the endosperm, which in a triploid seed turns out tetraploid, and genomic imprinting. They screened for mutants able to form triploid seeds and found paternally imprinted gene, that is dosage-sensitive and causes the failure of triploid seeds (Kradolfer & al 2013).
Anna Qvarnström and Hans Ellegren talked about different flycatcher projects. I don’t have that much clever to say about this right now, except that both projects are really fascinating and impressive. Everyone who cares about genomics in the wild should keep an eye on this.
There were two talks from Umeå Plant Science Centre: Stefan Jansson’s about association mapping in aspen (SwAsp), which sounds fun but difficult with tons of genetic variation, and Pär K. Ingvarsson’s about the Norway spruce genome (Nystedt & al 2013). An interesting observation from the latter was that it’s gigantic genome size (~20 Gb) apparently isn’t due to whole-genome duplications, but to unchecked transposable element activity. A nice nugget to remember: about half of the sequence, or three to four human genomes, consists of LTR-type repeats.
I’m afraid you will never read very much from me about theory talks. I am an engineer after all, so I don’t fear the equations that much, but most of the time I don’t have necessary context to have any clue where this particular model fits into the grand scheme of things. However, Jessica Abbott gave a fun talk presenting a model for sexual conflict in hermaphrodites that deserves a special mention.
I did see quite few a genomic plots of Fst outliers and I believe the question that needs answering about them is: What do they really mean? One can do comparisons of comparisons (like in Roger Butlin’s talk and their paper on parallel evolution of morphs in Littorina; Butlin & al 2013), but when it comes to picking out the most differentiated loci on a genome-wide level, are they really the most interesting loci? Are the loci of highest differentiation the loci of adaptation; are they the loci of speciation? (Ellegren’s talk and the flycatcher genome paper; Ellegren & al 2012). It’s a bit like the problem faced by QTL mappers — ”now that we’ve got a few genomic regions, what do we do with them?” — with the added complication that we don’t have a phenotype associated with them.
Genetic architecture wasn’t an explicit theme of the meeting, but it always comes up, doesn’t it? Will traits be massively polygenic, dooming researchers to a lifetime search for missing heritability, or relatively simple with a handful of loci? And under what circumstances will either architecture occur? Jon Ågren talked about the fantastic Arabidopsis thaliana in situ QTL mapping experiment. I think it is best illustrated with the video he showed last time I heard him talk about this — Lost in transplantation:
Folmer Bokma used Lego dinosaurs to great effect to illustrate developmental constraints. Also a large part of the talk was quotes from different famous evolutionary biologists. Very memorable, but I’m not sure I understood where he was heading. I was expecting him to start talking about the need for G matrix methods any moment. My lack of understanding is of course my fault as well, not just of the speaker’s, and there were a few graphs of gene duplications and gene expression data in primates, but I don’t feel that he showed ”how phylogenetic analyses of genomic data can shed new light on these ideas”, as promised in the abstract.
Possibly the best expression of the meeting: Erik Svensson’s ”next generation fieldwork”. I’m not a fan of the inflation of words ending in -omics (and I sometimes feel ”genomics” should just be ”genetics”), but if we have genomics and proteomics, phenomics is also justified, I guess. As a tounge-in-cheek version ”next generation fieldwork” is spot on. And very true: clever phenotyping strategies in natural populations and natural settings is more even more important than rapid sequencing and genotyping. By the way, Erik Svensson, Jessica Abbott, Maren Wellenreuther and their groups have a lab blog which seems nice and active.
And finally, the thing that wasn’t so great, coincidentally, the same thing that wasn’t so great at ESEB: the gender balance: only 7 out of 28 speakers were women. I don’t know to what extent that ratio reflect the gender ratio of Swedish evolutionary biology, but regardless it is too low.
It’s been a while since mid-January, but I’ve been busy (with some fun things — will tell you more later). And maybe we’ll see each other at the next Evolution in Sweden in Lund.
Andrew Gelman sometimes writes that in genetics it might make sense to have a null hypothesis of zero effect, but in social science nothing is ever exactly zero (and interactions abound). I wonder whether that is actually true even for genetics. Think about pleiotropy. Be it universal or modular, I think the evidence still points in the direction that we should expect any genetic variant to affect lots of traits, albeit with often very small effects. And think of gene expression where genes always show lots of correlation structure: do we expect transcripts from the same cells to ever be independent of each other? It doesn’t seem to me that the null can be strictly true here. Most of these differences have to be too small for us to practically be able to model them, though — and maybe the small effects are so far below the detection limit that we can pretend that they could be zero. (Note: not trying to criticise anybody’s statistical method or view of effect sizes here, just thinking aloud about the ”no true null effect” argument.)
A while ago I wrote a bit about the recent paper on epigenetic inheritance of acetophenone sensitivity and odorant receptor expression. I spent most of the post talking about potential problems, but actually I’m not that negative. There is quite a literature building up about these transgenerational effects, that is quite inspiring if a little overhyped. I for one do not think epigenetic inheritance is particularly outrageous or disrupting to genetics and evolution as we know it. Take this paper: even if it means inheritance of an acquired trait, it is probably not very stable over the generations, and it is nothing like a general Lamarckian transmission mechanism that can work for any trait. It is probably very specific for odourant receptors. It might allow for genetic assimilation of fear of odours though, which would be cool, but probably not at all easy to demonstrate. But no-one knows how it works, if it does — there are even multiple unknown steps. How does fear conditioning translate to DNA methylation differences sperm that translates to olfactory receptor expression in the brain of the offspring?
A while after the transgenerational effects paper I saw this one in PNAS: Rare event of histone demethylation can initiate singular gene expression of olfactory receptors (Tan, Song & Xie 2013). I had no idea olfactory receptor expression was that fascinating! (As is often the case when you scratch the surface of another problem in biology, there turns out to be interesting stuff there …) Mice have lots and lots of odorant receptor genes, but each olfactory neuron only expresses one of them. Apparently the expression is regulated by histone 3 lycine 9 methylation. The genes start out methylated and suppressed, but once one of them is expressed it will keep all other down by downregulating a histone demethylase. This is a modeling paper that shows that if random demethylation happens slowly enough and the feedback to shut down further demethylation is fast enough, these steps are sufficient to explain the specificity of expression. There is are some connections between histone methylation and DNA methylation: it seems that DNA methylation binds proteins that bring histone methylases to the gene (review Cedar & Bergman 2009). Dias & Ressler saw hypomethylation near the olfactory receptor gene in question, Olfr151. Maybe that difference, if it survives through to the developing brain of the offspring, can make demethylation of the locus more likely and give Olfr151 a head start in the race to become the first expressed receptor gene.
Brian G Dias & Kerry J Ressler (2013) Parental olfactory experience influences behavior and neural structure in subsequent generations Nature neuroscience doi:10.1038/nn.3594
Longzhi Tan, Chenghang Zong, X. Sunney Xie (2013) Rare event of histone demethylation can initiate singular gene expression of olfactory receptors. PNAS 10.1073/pnas.1321511111
Howard Cedar, Yehudit Bergman (2009) Linking DNA methylation and histone modification: patterns and paradigms. Nature reviews genetics doi:10.1038/nrg2540
Journal club of one: ”Parental olfactory experience influences behavior and neural structure in subsequent generations”
Okay, neither chickens nor genetics, really, but a little epigenetic inheritance. Dias & Ressler in Nature neuroscience:
When an odor (acetophenone) that activates a known odorant receptor (Olfr151) was used to condition F0 mice, the behavioral sensitivity of the F1 and F2 generations to acetophenone was complemented by an enhanced neuroanatomical representation of the Olfr151 pathway.
Meaning that the offspring of conditioned mice score higher in an odour potentiated startle test (more about that below), avoid the odour at a lower concentration in an aversion test and have more neurons expressing that odorant receptor in their olfactory epithelium and bulb, counted by betagalactosidase staining in transgenic mice expressing M71, the product of Olfr151, coupled to LacZ.
Bisulfite sequencing of sperm DNA from conditioned F0 males and F1 naive offspring revealed CpG hypomethylation in the Olfr151 gene. In addition, in vitro fertilization, F2 inheritance and cross-fostering revealed that these transgenerational effects are inherited via parental gametes.
That is, they detect a difference in methylation in one CpG dinucleotide in the 3′ region of the gene.
First, I love how the journal does exactly the thing I like to see with figures: below each figure is a link that leads to a data file with the underlying data!
Olfactory behaviour is not my thing, so the tests are new to me, but I’m a bit puzzled by the way they calculate the results from the odour potentiated startle tests. The point is to test whether the presence of the odour make the mice react stronger to a noise. After buzzing the sound 15 times without odour, they perform ten trials with odour plus sound and ten trials with sound only. But in calculating the score, they use only the difference between the first trial with odour and the last trial with sound only divided by how much the mouse reacted to the last of the first 15 sounds. Maybe this is standard, but why throw away the trials in between?
It is not only the olfactory potentiated startle and the sensitivity test, but the staining results. Again, this is not my area, but the results all seem to point to increased sensitivity in the offspring of the treated animals. They react stronger in the startle test, react at lower concentration in the avoidance test and they (in this case, the transgenic mice) have more neurons expressing M71. The cross fostering and the fact that the males were treated but not the females points to genuine inheritance. So, how does the treatment get into the germline? It has to cross that boundary and enter the sperm somehow. Unless there is some mysterious way for information from the central nervous system to travel to the testis, acetophenone must affect the spermatogenesis as well as the olfactory neurons.
All this is very hypothetical, so a little skepticism is not surprising. Gonzalo Otazu wrote in a comment on the Nature news webpage:
The statistical tests in the paper, both for the behavioral measurements as well as for the size of the M71 glomeruli , use as n, number of samples, the number of F1 and F2 individuals. This would be fine if the individuals were actually independent samples. However, they arise from a presumably small number of FO males. The numbers of FO males are not given in the paper. This is a major concern given that there is a lot of variability in the levels of expression of olfactory receptors in these mice that might be inheritable …
I think this is a good point but it will not be solved, as the comment later suggests, by adjusting the degrees of freedom of the test. From the F1 generation and on, genetic differences between the treatment groups, if they do exist, will amplify into a bias issue. That is, it is a systematic difference that might be bigger or smaller than the treatment effect and go in the same or opposite direction — we don’t know. However, the bias should not be there all the time, and not in the same direction, so it strengthens the authors’ case that they’ve done the treatment at least twice (with C57B/6J and with M17-LacZ mice, if not more times).
Maybe my preference for genetics is showing, but I feel the big unadressed alternative hypothesis in most transgenerational effects experiments is cryptic heritability. If you divide individuals into two groups, treat one of them and look for treatment effects in the offspring, you need to be sure that there are not genetc differences between the founders of the two groups. In the subsequent generations, genetic and non-genetic inheritance will be counfounded by design.
Again, randomisation and replication will help, but to be really sure, maybe one can use founders of known relatedness to create a mixed population — say take founders from full-sibships and split them equally between treatment groups, allowing segregation to randomise the genotypes of the next generation. It doesn’t say in the methods — the authors might even have done something like this. One could even use a genetic mixed model that includes relatedness as to estimate treatment effects over in the prescence of a genetic effect. I have a suspicion this experiment would require a much larger sample size, which means more time, work and animals — but I also believe that many would find confounding genetic variation more plausible than transgenerational epigenetic effects of unknown mechanism.
Brian G Dias & Kerry J Ressler (2013) Parental olfactory experience influences behavior and neural structure in subsequent generations Nature neuroscience doi:10.1038/nn.3594
Who still uses gene expression microarrays? I do and lots of other people do. And even though it’s pretty clear that RNA-seq is better, as long as it’s more expensive — and it probably still is for many combinations of microarray and sequencing platforms — the trade-off between the technical variability and sample size should still favour microarrays. But the breaking point probably occurs about right now, and I’m looking forward to seeing lots of sequencing based genetical genomics with splice-eQTL, antisense RNA-eQTL and what not! But then again, the same might happen for RNA-seq in a few years: I hope people stick with current generation massively parallel sequencing long enough to get decent sample sizes instead of jumping to small-N studies with the next technology.
Journal club of one: ”Short copy number variations potentially associated with tonic immobility response in newly hatched chicks”
(‘Journal club of one’ will be quick notes on papers, probably mostly about my favourite topics — genetics and the noble chicken.)
Abe, Nagao & Inoue-Murayama (2013), recently published this paper in PLOS ONE about copy number variants and tonic immobility in two kinds of domestic chicken. This obviously interests me for several reasons: I’m working on the genetic basis of some traits in the chicken; tonic immobility is a fun and strange behaviour — how it works and if it has any adaptive importance is pretty much unknown, but it is a classic from the chicken literature — and the authors use QTL regions derived directly from the F2 generation of cross that I’m working on — we’ve published one paper so far on the F8 generation.
Results: They use arrays and qPCR to search for copy number variants in three regions on chromosome one in two breeds (White Leghorn and Nagoya, a Japanese breed). After quite a bit of filtering they end up with a few variants that differ between the breeds. The breeds also differ in their tonic immobility behaviour with Leghorns going into tonic immobility after three attempts on average and lying still for 75 s and Nagoya taking 4.5 attempts and lying for 100 s on average. But the copy number variants were not associated with tonic immobility attempts or duration within breeds, so there is not really any evidence that they affect tonic immobility behaviour.
Apart from the issue that the regions (more than 60 Mb) will contain lots of other variants, we do not know whether these regions affect tonic immobility behaviour in these breeds in the first place. The intercross that the QTL come from is a wild by domestic Red Junglefowl x White Leghorn cross, and while Nagoya seem a very interesting breed that is distant from White Leghorn they are not junglefowl. When it comes to the Leghorn side of the experiments, I wouldn’t be surprised White Leghorn bred on a Swedish research institute and a Japanese research institute differed quite a bit. The breed differences in tonic immobility is not necessarily due to the genetic variants identified in this particular cross, especially since behaviour is probably very polygenic, and an F2 QTL study by necessity only scratches the surface.
In the discussion the authors bring up power: There were 71 Nagoya and 39 White Leghorn individuals and the experiment might be unable to reliably detect associations within the breeds. That does seem likely, but making a good informed guess about the expected effect is not so easy. A hint could come from looking at the effect sizes in the QTL study, but there is no guarantee that genetic background will not affect them. I don’t know really what this calculation comes from: ”Sample sizes would need to be increased more than 20-fold over the current study design” — maybe 11 tested copy number variants times two breeds? To me, that seems both overly optimistic, because it assumes that the entire breed difference would be due to these three QTL on chromosome 1, and overly pessimistic, since it assumes that the three QTL would fractionate into 11 variants.
Finally, with all diversity in the chicken, there’s certainly a place both for within and between population studies of various chickens with all kinds of genomic! Comparing breeds with different selection histories should be very interesting for distinguishing early ‘domestication QTL’ from ‘productivity QTL’ selected under modern chicken breeding. And I wish somebody would figure out a little more about how tonic immobility works.
Abe H, Nagao K, Inoue-Murayama M (2013) Short Copy Number Variations Potentially Associated with Tonic Immobility Responses in Newly Hatched Chicks. PLoS ONE 8(11): e80205. doi:10.1371/journal.pone.0080205