Posts Tagged ‘Paper reviews’

Lots of links about commenting...

Thursday, May 28th, 2009

A Blog Around the Clock has a list of links to blog posts about commenting on (scientific) papers.

There have been quite a few posts over the last few days about commenting, in particular about posting comments, notes and ratings on scientific papers. But this also related to commenting on blogs and social networks, commenting on newspaper online articles, the question of moderation vs. non-moderation, and the question of anonymity vs. pseudonymity vs. RL identity.

Read the post to get all the links.

I must admit that I have never left a comment on an online paper.  If I blog about a paper, I leave a traceback, but that is as far as it goes.

Since putting a review of a paper on my blog, just to add a comment, is a lot of work, I guess I should just get used to leaving comments instead.

Still, I am reluctant to comment on papers.  I don't mind firing off a half thought through comment off at a blog, but I feel that for a scientific paper I should make sure I understand all the details of the paper before I start commenting on it.  I guess I just have to overcome that feeling.

--

147-158=-11

Simultaneous analysis of all SNPs in a genome-wide association study

Monday, September 15th, 2008

In our association mapping journal club a few weeks back, we discussed this paper (I just never got around to writing down my thoughts on it until now):

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Hoggard, Whittaker, De Iorio and Balding, PLoS Genetics 2008

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

I already heard about the method when I was visiting Imperial College to give a seminar last year, so I am happy that I can finally talk about it.

It is a pretty neat idea.

Regression analysis in association mapping

If you want to figure out which parameters are important for predicting some property, a good old statistical approach is regression analysis.

For a binary property, such as case or control in an association study, you could use logistic regression, but in general you construct some linear function of your parameters and transform them into the "property space" through a link function.

This setup gives you a "model" and depending on the link and the setup you have different ways of interpreting this as a statistical model with a corresponding likelihood function.

The coefficients in the linear combination of parameters are the parameters in the model, and you typically maximize the likelihood with respect to them to get your estimate for them.

In some cases you can directly interpret the parameters, but more often than not you are only interested in knowing whether there is strong evidence in the data that they should be non-zero, i.e. that the parameter in question actually has an effect on the property.

In an association study, you would use your SNPs as your parameters and you consider those SNPs with a non-zero coefficient associated with the disease.

Of course, it is never as simple as that.

Two things complicate matters: your best estimate of a coefficient will never actually be zero, so you want to test if they are significantly different from zero.  Another problem is that you have many more parameters (SNPs) than you have outcomes (individuals), so you will overfit from hell.

Strong "zero" priors

What they do in this paper is both simple and very clever.

They consider the problem in a Bayesian setting and put strong priors on the coefficients, that will tend to keep them at zero unless the signal in the data  is strong enough to pull them away from there.

They then test for association by testing if the mode of the posteriors for these parameters have moved away from zero.

A very nice consequence of this is that you can analyse the entire data at the same time, rather than testing markers individually, which means that if several markers are in LD with a causal marker, you will tend to only pick one of them and recognize that the signal in the others is essentially the same signal.

It also seems quite computationally feasible.  A few hours on a desktop computer to analyse a GWA data set.


Clive J. Hoggart, John C. Whittaker, Maria De Iorio, David J. Balding, Peter M. Visscher (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies PLoS Genetics, 4 (7) DOI: 10.1371/journal.pgen.1000130

Investigating Selection on Viruses: A Statistical Alignment Approach

Wednesday, June 18th, 2008

ResearchBlogging.org
Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn't one of the papers I've been bitching about -- this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter -- who does work with this -- and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

S. de Groot, T. Mailund, G.A. Lunter and J. Hein

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I'll add a link to the paper as soon as it is up at the journal.

What's the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, 'cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won't work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won't scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth's recent approach might have worked -- it is the paper I blogged about a little back -- but we didn't know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time...

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn't really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work...


de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics

Recombination and substitution rates

Thursday, May 22nd, 2008

ResearchBlogging.orgIn a paper from PLoS Genetics earlier this month, Laurent Duret and Peter F. Arndt did a genome wide analysis of the correlation between recombination rate and substitution rate (and bias).

The Impact of Recombination on Nucleotide Substitutions in the Human Genome

Duret, L., Arndt, P.F. PLoS Genetics, 4(5) 2008

Abstract

Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue for detecting selection within sequences. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We analyzed the pattern of neutral substitutions in 1 Gb of primate non-coding regions. We show that the GC-content toward which sequences are evolving is strongly negatively correlated to the distance to telomeres and positively correlated to the rate of crossovers (R2 = 47%). This demonstrates that recombination has a major impact on substitution patterns in human, driving the evolution of GC-content. The evolution of GC-content correlates much more strongly with male than with female crossover rate, which rules out selectionist models for the evolution of isochores. This effect of recombination is most probably a consequence of the neutral process of biased gene conversion (BGC) occurring within recombination hotspots. We show that the predictions of this model fit very well with the observed substitution patterns in the human genome. This model notably explains the positive correlation between substitution rate and recombination rate. Theoretical calculations indicate that variations in population size or density in recombination hotspots can have a very strong impact on the evolution of base composition. Furthermore, recombination hotspots can create strong substitution hotspots. This molecular drive affects both coding and non-coding regions. We therefore conclude that along with mutation, selection and drift, BGC is one of the major factors driving genome evolution. Our results also shed light on variations in the rate of crossover relative to non-crossover events, along chromosomes and according to sex, and also on the conservation of hotspot density between human and chimp.

The main point of this paper is the evolution of the GC content of the human genome, that varies significantly in various regions of the genome -- the so-called isochore structure.

The evolution of isochores

The content of GC nucleotides vary along the genome, with some regions having very high fractions of GC and some having very low, and this variation is not what we would expect the sequence to look like if the entire genome was evolving under the same neutral process.

Why the genome has this structure has been debated (at time heated debates) the last two decades. Different explanations have been suggested, including:

  1. The mutation rate is biased and varies along the genome.
  2. Selection prefers high GC content in some regions and not in others.
  3. Gene conversion is biased, preferring to replace AT alleles with GC alleles.

where the later is a theory developed, among others, by the authors of this new paper.

Biased mutation rates is of course a possibility, but doesn't explain the correlation with the recombination rate, unless the latter is mutagenic or causes this bias.

Selection is the explanation of Bernardi, the discoverer of the isochore structure.

Biased gene conversion is a neutral process that looks a lot like selection. The idea is as follows: there is no particular need for a bias in the mutation process -- the AT to GC and GC to AT substitutions are not necessarily occurring at different rates in GC rich and GC poor regions -- but once a polymorphism exists, gene-conversion between a GC allele and an AT allele will replace the AT allele with the GC allele more often than the other way around.

A consequence of this is, that although the mutation rate might not vary along the genome, the substitution rate will, and this substitution rate will be correlated with the recombination rate.

Eyre-Walker and Hurst (2001) gives more details on the three theories above.

The case for biased gene conversion

In the PLoS Genetics paper they argue for the biased gene conversion explanation (not surprisingly), and reasonably convincingly, in my opinion, but I am not an expert...

First, they construct a model of sequence evolution that does not assume time-reversibility and that the current sequences are at stationarity (which is usually assumed, but might not be true).

From this model, they estimate the substitution rate of the various types of substitutions, and they estimate the equilibrium GC content (called GC* in the paper). In the model, the equilibrium GC content can be different than the current GC content, as stationarity is not assumed, and in general GC* < GC meaning that the GC content in our genome -- and this especially in GC rich areas -- is decreasing. Very slowly, though.

This could suggest that whatever mechanism created the GC rich areas of our genome is either no longer in effect, or at least is weaker than it was when the GC rich areas were created.

They then consider the correlation between recombination rates and GC / GC* and notice a significant correlation, with a stronger correlation between recombintion rate and GC* than between recombination and GC.

This is take as evidence that it is recombination that drives the direction of mutations toward GC content, rather than base pair composition that determines recombination rate; if the recombination rate was determined by the base pair composition, then the present day GC content should be more correlated with the rate than some far future stationary GC content.

The biased gene conversion model suggest a preference for AT to GC substitutions in regions with high recombination rates, but where the strength of this preference depends on the effective population size.

The positive correlation between GC* and the recombination rate supports this, and the present day effective population size (or the present day recombination rate) can explain why the GC structure in the genome is eroding towards a higher AT content in the present day GC rich regions. The GC rich regions of today could have appeared in an ancestor with either a larger effective population size, or regional larger recombination rates, and the reduction in the effective population size in the present day humans is just not large enough that the biased gene conversion mechanism can keep the GC content at a high level.

The case against biased mutation and against selection

The biased mutation explanation is argued against based on the frequency patterns of polymorphisms. If the mutations are biased, but the resulting polymorphisms are selectively neutral, then the frequency of GC and AT derived polymorphisms should be the same.  However, GC alleles segregate at higher frequencies than AT alleles.

The first argument against selection is less convincing, I feel, but essentially says: it is hard to imagine why selection should prefer the occasional GC  in Mbp long regions with plenty of genes under selection, and even if it did, it probably wouldn't be strong enough to drive the changes in GC content.  Well...

The second argument is that selection does not explain why GC content, and especially GC*, should be correlated with the recombination rate.  One possible explanation is the Hill-Robertson effect, but then the correlation should be between GC* and the population recombination, but GC* is stronger correlated with male recombination rate than with female recombination rate, something Hill-Robertson does not explain.

Conclusion

I read this paper because I was reading up on the correlation between effective population size and recombination rate for a project I'm working on.  I knew about the debate about isochores -- I've chatted with some of the biased gene conversion proponents who have visited BiRC -- but I never really read up on it.

It turns out that several of my colleagues at BiRC are interested in this, so we've discussed the paper over the last two days, and I've had a lot of fun reading my way through some of the references in the paper.

I would recommend it as an introduction to this, but of course not a neutral discussion of the three theories.


Duret, L., Arndt, P.F. (2008). The Impact of Recombination on Nucleotide Substitutions in the Human Genome. PLoS Genetics, 4(5), e1000071. DOI: 10.1371/journal.pgen.1000071

Eyre-Walker, A., Hurst, L.D. (2001). The evolution of isochores. Nature Reviews Genetics, 2(7), 549-555. DOI: 10.1038/35080577

I hadn't noticed that...

Wednesday, May 21st, 2008

At Genomicron, Ryan Gregory refuses to participate in ResearchBlogging. Why? Because their slogan is Discussing and Creating Peer-reviewed Research. Discussing is fine, but we are not creating peer-reviewed research by blogging about it.

I hadn't noticed this slogan -- it is only on the large icon and I only use the small icons when I use it in posts about published research -- but it is not something I worry too much about. I like to read discussions about published papers in blogs, but I am not kidding myself that much research is being created there.

I'll still use the icon to highlight when I am discussing a paper -- and not some more general issue.

Another, older complaint, is that blogging on peer-reviewed research it can be confused for the actual peer-review process:

As a scientist, I take the peer review system very seriously (its several problems notwithstanding) and I do not wish to see blogs perceived as even an approximation of that system. That said, blogs are a useful way to discuss research, and I am happy to see this new development in science communication.

Again, I love reading about paper discussions -- it feels like a global journal club -- but I agree that the actual peer-review process has very little to do with blog discussions of papers!