There’s an interesting paper in the current issue of Bioinformatics that I’ve just finished reading:
Webb et al. Bioinformatics 25(2) 197-203
Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.
Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.
The paper presents a new method for analysing sequences that have undergone recombination.
When sequences have not undergone recombination, a nice methodology for analysing them is the PhyloHMM (PDF). With this method, you have a hidden Markov model where the emission probability is determined by a phylogeny, and usually computed using Felsenstein’s pruning algorithm.
When there is recombination, the problem is that there are more than one topology for the underlying phylogeny, and if you do not know the topologies you cannot immediately calculate the emission probabilities.
This approach doesn’t scale, however, since the number of possible toplogies grows super-exponential with the number of species.
In this paper the solve the problem by using only a few topologies as states in the HMM, but sampling over all possible topologies to be used, in an MCMC approach. Ideally the number of topologies should be variable, but that requires a reversible jump MCMC and they haven’t implemented that. Still, it seems to work very well.
I remember discussing the problem with both Alex and Chris when I was last in Oxford, but back then it didn’t work so well, so I am happy to read that they’ve solved the problems. Properly handling recombination and changing topologies is important for accurate sequence analysis.
A. Webb, J. M. Hancock, C. C. Holmes (2008). Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, 25 (2), 197-203 DOI: 10.1093/bioinformatics/btn607