Archive for the ‘english’ Category
I’m attending a course about scientific publishing, and the other day there was lecture about bibliometrics by Lovisa Österlund and David Lawrence from the Linköping University library. I don’t think I know anyone who particularly likes bibliometrics, but I guess it makes sense that if one needs to evaluate research without trying to understand what it is about there are only citations, the reputation of the publication channel and the cv of the researcher to look at. I imagine it’s a bit like reviewing a novel in a language one doesn’t know. A couple of things occured to me, though.
What to do when different instruments of evaluation give different results? Take the two papers (so far) published during my PhD: they both deal with the genetics of chicken comb size; one is published in PLOS Genetics and one in Molecular Ecology. If we look at journal impact factors (and we shouldn’t, but say that we do), PLOS Genetics comes out ahead with an impact factor of 8.5 against 6.3. For those that do not know this about it, journal impact factor is the mean number of citations for papers in that journal the last two years calculated by Thomson Reuters in their own secret way. However, Linköping University has for some reason decided to use the Norwegian index for evaluating publication channels. I don’t know why, and I don’t think it matters that much for me personally, since the system will change soon and I will finish in about a year and a half. In the Norwegian system journals are ranked as level one or two, where two is better and is supposed to represent the top 20% of that subject area. According to their database, Molecular ecology is level 2, while PLOS Genetics is level 1. The source of the discrepancy is probably that PLOS Genetics is counted as biomedicine, while Molecular Ecology is biology, according to the Norwegian database.
They also mentioned Altmetrics, and I don’t know what to make of it. On one hand, I guess it’s good to keep tabs on social media. On the other hand, what do numbers of tweets really tell you, except that one of the authors has a Twitter account? One of the examples in the lecture was the metrics page for this paper that I happen to be a contributor to. It is actually pretty strange. It shows three tweets or 11 tweets, depending on where on the page you look. Also, when I accessed this page earlier today it linked a blog. Now it doesn’t. That says something about the ephemeral nature of internet media. Regardless, when I first saw the page I thought perhaps the metrics page had picked up on my post about the paper, but that was not the case. I don’t know how altmetric.com define a ”science blog”, maybe the blog has to be listed on some aggregation site or another, and I’m not pretending my post is particularly insightful or important. Still it’s a little strange that the altmetrics page doesn’t list a post by one of the authors about the paper, but listed a post that referred to the paper with only two sentences and was mistaken about the conclusion.
I believe there is an analogy between the dual meaning of the word ”gene” and two senses of epigenetics, that this distinction is easy to get wrong and that it contributes to the confusion about the meaning of epigenetics. Gene can mean a sequence that has a name and a function, or it can mean a genetic variant. I sometimes, half-jokingly, call this genetics(1) and genetics(2). The order is wrong from a historical perspective, since the study of heritable variation predates the discovery of molecular genes. The first deals with the function of sequences and their products. The second deals with differences between individuals carrying different variants.
The same can be said about epigenetics. On one hand there is epigenetics(1), aiming to understand the normal function of certain molecular features, i.e. gene regulatory states that can be passed on through cell division. On the other hand, epigenetics(2) aims to explain individual variation between individuals that differ not in their DNA sequence but in other types of heritable states. And the recurring reader knows that I think that, since a lot of genetics(2) makes no assumptions about the molecular nature of the variation it studies, it will mostly work even if some of these states turn out to be epigenetic. In that sense, epigenetics(2) is a part of genetics.
E.O. Wilson: This is going to be a conversation that I will have with B.F. Skinner. This is Ed Wilson. He invited me to talk about sociobiology. Our relations have always been very friendly and I look forward to it. This should be an interesting talk this Thursday morning.
B.F. Skinner: We will start with a basic statement. I assume that you are what I call a behaviorist. You would accept that an organism is a biophysical and biochemical system, a product of evolution.
E.O. Wilson: I am.
B.F. Skinner: That would include not only genetic behavior, but also the kinds of behavior that can be learned because of genetic processes. Of course it (behavior) always goes back to genetics.
Naour, P (2009) E. O. Wilson and B. F. Skinner. A dialogue between sociobiology and radical behaviorism. New York: Springer.
Journal club of one: ”Maternal and additive gentic effects contribute to variation in offspring traits in a lizard”
The posts this week have been about epigenetics. However, let’s step back from the molecular mechanisms and what not to look at the bigger picture. This recent paper by Noble, McFarlane, Keogh and Whiting (2014) looks at maternal effects and additive genetic effects on fitness-related traits in a lizard. Now we are in quantitative genetics territory where one uses pedigrees and phenotypes to look at the determinants of a trait while abstracting away the mechanistic details. Nowadays, quantitative genetics is also equipped with Bayesian animal models and the ability to do parentage assignment with molecular methods.
The authors measured at size, body mass, and growth and as well as the speed and endurance when running. The fun part is that while only endurance had a substantial heritability (0.4), the other traits had maternal components in the 0.2-0.5 range. So for most of the traits there’s little heritability while a big chunk of the trait variance is explained by maternal effects.
I like the idea to include maternal traits to see look at what causes the maternal effect. Clutch size, maternal size and condition seem matter for some trait or another. In two cases the maternal effect is entirely explained away: the effect on growth by birth date and clutch size, and sprint speed by birth date.
The inferences come from an animal model that include a maternal effect. Something I’m curious about is how heritability would be overestimated if the maternal component was not accounted for. That is beside the point of the paper, though.
Another interesting point: I think everyone who deals with animals in some type of controlled environment wonder about how much our measurements differ from what would’ve been measured in a more natural environment. In this case, the authors measured offspring growth both in the test environment and in an enclosure. They find a maternal effect in the test environment, while the interval for the heritability goes from almost zero to 0.5. In the wilder environment they estimate very little genetic and maternal variance, as well as a larger residual variance. I don’t know if this is just because of increased noise, or because maternal effects actually interact with condition.
Also, I love figure 1 (the one figure). If more papers had caterpillar plots of most important estimated quantities, the world would be a better place.
Paper: ”Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens”
Since I love author blog posts about papers, I thought I’d write a little about papers I’ve contributed too. So far, they’re not that many, but maybe it can be a habit.
”Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens” was published in BMC Genomics in 2012. The title says it very well: the paper looks at differential expression and DNA methylation of a subset of genes in the hypothalamus of Red Junglefowl and domestic White Leghorn chickens. My contribution was during my MSc project in the group. Previously (Lindqvist & al 2007; Nätt & al 2009) Daniel Nätt, Pelle Jensen and others found a transgenerational effect of unpredictable light stress on domestic chickens. After that, and being interested in chicken domestication, a DNA methylation comparison of wild and domestic seems like a natural thing to do. And it turns out Red Junglefowl and White Leghorns differ in expression of a bunch of genes and in methylation of certain promoters (where promoter is operationally defined as a region around the start of the gene model). And when looking at two generations, the contrasts are correlated between parent and offspring. There is some heritable basis of the differences in gene expression and DNA methylation.
In Red Junglefowl, ancestor of domestic chickens, gene expression and methylation profiles in thalamus/hypothalamus differed substantially from that of a domesticated egg laying breed. Expression as well as methylation differences were largely maintained in the offspring, demonstrating reliable inheritance of epigenetic variation.
What I did was methylation sensitive high resolution melting. HRM is a typing method based on real time PCR. After PCR you often make a melting curve by ramping up the temperature, denaturing the PCR product. The melting characteristics depend on the sequence, so you can use melting to check that you get the expected PCR product, and it turns out that the difference can be big enough to type SNPs. And if you can type SNPs, you can analyse DNA methylation. So we treat the DNA with bisulfite, which deaminates cytosines to uracil unless they are protected by methylation, and get a converted sequence where an unmethylated C is like a C>T SNP. We set up standard curves with a mixture of whole-genome amplified and in vitro methylated DNA and measured the degree of methylation.
That is averaging over the population of DNA molecules in the sample; I’ve been wondering how HRM performs when the CpGs in the amplicon have heterogenous methylation differences. We’ve used HRM for genotyping as well, and it works, but we’ve switched to pyrosequencing, which gives cleaner results and where the assay design is much easier to get right the first time. I don’t know whether the same applies for methylation analysis with pyro.
My favourite part of the paper is figure 4b (licence: cc:by 2.0) which shows methylation analysis in the advanced intercross of Red Junglefowl and White Leghorns, which immediately leads to, as mentioned in the paper, the thought of DNA methylation QTL mapping.
Nätt, D., Rubin, C. J., Wright, D., Johnsson, M., Beltéky, J., Andersson, L., & Jensen, P. (2012). Heritable genome-wide variation of gene expression and promoter methylation between wild and domesticated chickens. BMC genomics, 13(1), 59.
Lindqvist C, Janczak AM, Nätt D, Baranowska I, Lindqvist N, et al. (2007) Transmission of Stress-Induced Learning Impairment and Associated Brain Gene Expression from Parents to Offspring in Chickens. PLoS ONE 2(4): e364. doi:10.1371/journal.pone.0000364
Nätt D, Lindqvist N, Stranneheim H, Lundeberg J, Torjesen PA, et al. (2009) Inheritance of Acquired Behaviour Adaptations and Brain Gene Expression in Chickens. PLoS ONE 4(7): e6405. doi:10.1371/journal.pone.0006405
What is is that is so scandalous about epigenetic inheritance? Not much, in my opinion. Some of the points on the spectrum clearly happen in the wild: stable and fluctuating epigenetic inheritance in plants, parental effects in animals and genomic imprinting in both. Widespread epigenetic inheritance in animals would change a lot of things, of course, but even if epigenetic inheritance turns out to be really important and common, genetics and evolution as we know them will not break. The tools to study and understand them are there.
Looking back at the post from yesterday, there are different flavours of epigenetic inheritance. At the most heritable end of the spectrum, epigenetic variants behave pretty much like genetic variants. Because quantitative genetics is agnostic to the molecular nature of the variants, as long as they behave like an inheritance system, most high-level genetic analysis will work the same. It’s just that on the molecular level, one would have to look to epigenetic marks, not to sequence changes, for the causal variant. Even if a substantial proportion of the genetic variance is caused by epigenetic variants rather than DNA sequence variants, this would not be a revolution that changes genetics or evolution into something incommensurable with previous thought.
The most revolutionary potential lies somewhere in the middle of the scale, in parental effects with really high fidelity of transmission that are potentially responsive to the environment, but in principle these things can still be dealt with by the same theoretical tools. Most people just didn’t think they were that important. How about soft inheritance? It seems dramatic, but all examples deal with specific programmed mechanisms: soft inheritance of the sensitivity to a particular odour or of the DNA methylation and expression state of a particular locus. No-one has yet suggested a generalised Lamarckian mechanism; that is still out of the question. DNA mutations are still unable to pass from somatic cells to gametes. Whatever tricks transgenerational mechanisms use to skip over the soma–germline distinction, they must be pretty exceptional. Discoveries of widespread soft inheritance in nature would be surprising, a cause for rethinking certain things and great fun. But conceptually, it is parental effects writ large. We can understand that. We have the technology.
Let us think aloud about the different possible meanings of epigenetic inheritance. I don’t want to contribute to unnecessary proliferation of terminology — people have already coined molar/molecular epigenetics (Crews 2009), intergenerational/transgenerational effects (Heard & Martienssen 2014), and probably several more dichotomies. But I thought it could be instructive to try to think about epigenetic inheritance in terms of the contribution it could make to variance components of a quantitative genetic model. After all, quantitative genetics is mostly agnostic about the molecular nature of the heritable variation.
At one end of the spectrum we find molecular epigenetic marks such as DNA methylation, as they feature in the normal development of the organism. Regardless of how faithfully they are transmitted through mitosis, or even if they pass through meiosis, they only contribute to individual variation if they are perturbed in different ways between individuals. If they do vary between individuals, though, in a fashion that is not passed on to the offspring, they will end up in the environmental variance component.
What about transmissible variation? There are multiple non-genetic ways for information to be passed a single generation: maternal or paternal effects need not be epigenetic in the molecular sense. They could be, like genomic imprinting, but they could also be caused by some biomolecule in the sperm, something that passes the blood–placenta barrier or something deposited by the mother into the egg. Transgenerational effects of this kind make related individuals more similar, they will affect the genetic variance component unless they are controlled. And in the best possible world of experimental design, parental effects can be controlled and modelled, and we can in principle separate out the maternal, paternal and genetic component. Think of effects like in Weaver & al (2004) that are perpetuated by maternal behaviour. If the behavioural transmission is strong enough they might form a pretty stable heritable effect that would appear in the genetic variance component if it’s not broken up by cross-fostering.
However, if the variation behaves like germ-line variation it will be irreversible by cross-fostering, inseparable from the genetic variance component, and it will have the potential to form a genuine parallel inheritance system. The question is: how stable will it be? Animals seem to be very good at resetting the epigenetic germline each generation. The most provocative suggestion is probably some type of variation that is both faithfully transmitted and sometimes responsive to the environment. Responsiveness means less fidelity of transmission, though, and it seems (Slatkin 2009) like epigenetic variants need to be stable for many generations to make any lasting impact on heritability. Then, at the heritable end of the spectrum, we find epigenetic variants that arise from some type of random mutation event and are transmitted faithfully through the germline. If they exist, they will behave just like any genetic variants and even have a genomic locus.
I’ve had this blog since 2010, but it was not until last year that I started writing anything else than popular/science in Swedish. There is lots of discussion on academic blogs about whether PhD students, or any academics, should write on blogs or not and also quite a bit of fear, uncertainty and doubt going around. This is what I think: I don’t think my blog is such a big deal. It’s just a small hobby project that makes me happy. And while I hope it doesn’t hurt my research or my chances to continue doing science, I don’t think it helps them much either.
Do I have a target audience? There was recently a small survey to find what academics blog about and why; they found that most blogs were directed at peers, not for outreach. I’m not surprised. As I’ve already mentioned, my posts in Swedish are more popular/science, less technical and sometimes deal with things published in Swedish media. I think the target audience is still geeks of some kind, but not necessarily genetics geeks. My posts in English are more directed at academical things, either related to my research and work as a PhD student or about the R language. So my posts are a mix of languages and themes. Is that a problem? From a popularity or readership perspective, probably yes. I can see little reason not to split the posts to two blogs, each concentrated on one theme, except that I don’t feel like running two blogs.
Does blogging hurt me because it hurts my work? I hardly think so. First, blogging is not part of my duties at the university, and I don’t do it instead of writing, working in the lab or analysing data. I do it in the evening after work, or in the case of some posts in the morning before. I’m not convinced blogging makes me in any way a better scientist, but it can hardly make me worse. Thinking about science or how to explain it for another hour now and then can’t hurt. And yes, the time spent blogging could theoretically be spent writing papers or something, but so could theoretically the time spent at the gym, with family or friends. If we grant that academics do other things, blogging could be one of those activities. My blog is not completely disconnected from my work, but I think it’s disconnected enough to be regarded as a fun pastime.
Does blogging hurt my reputation because people might read my blog and disapprove? I don’t think that many people read my blog; actually, I know that not many people do. Still, it is certainly possible that some of the readers might be important to my career and that they don’t like what they see. It will be found when people look me up with a search engine. Maybe someone thinks that I’m wasting my time, or maybe I’ve written something controversial — or more likely, something stupid. I think and say things that are mistaken all the time, and some of those mistakes might end up in a blog post. The point is, though, that expressing my opinion about things I care about is not something I do because I think it’ll further my career. I do it because I want to. If my writing is successful, the things on my blog will be the kinds of things I honestly know, think and believe about science.
The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out. Yesterday we we took a simulated data frame, called data, and calculated some summary statistics. We could put the entire script together with %.%:
library(dplyr) data %.% melt(id.vars=c("treatment", "sex")) %.% group_by(sex, treatment, variable) %.% summarise(mean(value))
I haven’t figured out what would be the best indentation here, but I think this looks pretty okay. Of course it works for non-dplyr functions as well, but they need to take the input data as their first argument.
data %.% lm(formula=response1 ~ factor(sex)) %.% summary()
As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function. For example, we can do this, which would give an error with dplyr:
library(magrittr) data %>% lm(response1 ~ factor(sex), .) %>% summary()
Moreover, Conrad Rudolph has used the operators %.%, %|>% and %|% in his own package for functional composition, chaining and piping. And I’m sure he is not the only one; there are several more packages that bring more new ways to define and combine functions into R. I hope I will revisit this topic when I’ve gotten used to it and decided what I like and don’t like. This might be confusing for a while with similar and rather cryptic operators that do slightly different things, but I’m sure it will turn out to be a useful development.
I know I’m on about Hadley Wickham‘s packages a lot. I’m not the president of his fanclub, but if there is one I’d certainly like to be a member. dplyr is going to be a new and improved ddply: a package that applies functions to, and does other things to, data frames. It is also faster and will work with other ways of storing data, such as R’s relational database connectors. I use plyr all the time, and obviously I want to start playing with dplyr, so I’m going to repeat yesterday’s little exercise with dplyr. Readers should be warned: this is really just me playing with dplyr, so the example will not be particularly profound. The post at the Rstudio blog that I just linked contains much more information.
So, here comes the code to do the thing we did yesterday but with dplyr:
## The code for the toy data is exactly the same data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) ## reshape2 still does its thing: library(reshape2) melted <- melt(data, id.vars=c("sex", "treatment")) ## This part is new: library(dplyr) grouped <- group_by(melted, sex, treatment) summarise(grouped, mean=mean(value), sd=sd(value))
When we used plyr yesterday all was done with one function call. Today it is two: dplyr has a separate function for splitting the data frame into groups. It is called group_by and returns the grouped data. Note that no quotation marks or concatenation were used when passing the column names. This is what it looks like if we print it:
Source: local data frame [4,000 x 4] Groups: sex, treatment, variable sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407 7 1 1 response1 -0.07390246 8 1 2 response1 -1.34509686 9 1 1 response1 1.97215697 10 1 2 response1 -0.08145883
The grouped data is still a data frame, but it contains a bunch of attributes that contain information about grouping.
The next function is a call to the summarise function. This is a new version of a summarise function similar to one in plyr. It will summarise the grouped data in columns given by the expressions you feed it. Here, we calculate mean and standard deviation of the values.
Source: local data frame [8 x 5] Groups: sex, treatment sex treatment variable mean sd 1 1 1 response1 0.021856280 1.0124371 2 1 1 response2 0.045928150 1.0151670 3 1 2 response1 -0.065017971 0.9825428 4 1 2 response2 0.011512867 0.9463053 5 2 1 response1 -0.005374208 1.0095468 6 2 1 response2 -0.051699624 1.0154782 7 2 2 response1 0.046622111 0.9848043 8 2 2 response2 -0.055257295 1.0134786
Maybe the new syntax is slightly clearer. Of course, there are alternative ways of expressing it, one of which is pretty interesting. Here are two equivalent versions of the dplyr calls:
summarise(group_by(melted, sex, treatment, variable), mean=mean(value), sd=sd(value)) melted %.% group_by(sex, treatment, variable) %.% summarise(mean=mean(value), sd=sd(value))
The first one is nothing special: we’ve just put the group_by call into summarise. The second version, though, is a strange creature. dplyr uses the operator %.% to denote taking what is on the left and putting it into the function on the right. Reading from the beginning of the expression we take the data (melted), push it through group_by and pass it to summarise. The other arguments to the functions are given as usual. This may seem very alien if you’re used to R syntax, or you might recognize it from shell pipes. This is not the only attempt make R code less nested and full of parentheses. There doesn’t seem to be any consensus yet, but I’m looking forward to a future where we can write points-free R.