Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn’t one of the papers I’ve been bitching about — this one we got very helpful reviews on.
It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter — who does work with this — and came up with this:
Investigating Selection on Viruses: A Statistical Alignment Approach
To appear in BMC Bioinformatics
Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.
I’ll add a link to the paper as soon as it is up at the journal.
What’s the problem?
We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, ’cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won’t work.
Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won’t scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth’s recent approach might have worked — it is the paper I blogged about a little back — but we didn’t know about it at the time.
Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.
What did we do?
Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.
We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.
It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time…
Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.
We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn’t really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work…
de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics