There’s an interesting paper in the current issue of Bioinformatics that I’ve just finished reading:

Phylogenetic inference under recombination using Bayesian stochastic topology selectionWebb et al.

Bioinformatics25(2) 197-203

Abstract

Motivation:Conventional phylogenetic analysis for characterizing^{ }the relatedness between taxa typically assumes that a single^{ }relationship exists between species at every site along the^{ }genome. This assumption fails to take into account recombination^{ }which is a fundamental process for generating diversity and^{ }can lead to spurious results. Recombination induces a localized^{ }phylogenetic structure which may vary along the genome. Here,^{ }we generalize a hidden Markov model (HMM) to infer changes in^{ }phylogeny along multiple sequence alignments while accounting^{ }for rate heterogeneity; the hidden states refer to the unobserved^{ }phylogenic topology underlying the relatedness at a genomic^{ }location. The dimensionality of the number of hidden states^{ }(topologies) and their structure are random (not known a priori)^{ }and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all^{ }possible changepoints in topologies as well as all the unknown^{ }branch lengths.^{ }

Results:We demonstrate our approach on simulated data and also^{ }to the genome of a suspected HIV recombinant strain as well^{ }as to an investigation of recombination in the sequences of^{ }15 laboratory mouse strains sequenced by Perlegen Sciences.^{ }Our findings indicate that our method allows us to distinguish^{ }between rate heterogeneity and variation in phylogeny caused^{ }by recombination without being restricted to 4-taxa data.

The paper presents a new method for analysing sequences that have undergone recombination.

When sequences have not undergone recombination, a nice methodology for analysing them is the PhyloHMM (PDF). With this method, you have a hidden Markov model where the emission probability is determined by a phylogeny, and usually computed using Felsenstein’s pruning algorithm.

When there *is* recombination, the problem is that there are more than one topology for the underlying phylogeny, and if you do not know the topologies you cannot immediately calculate the emission probabilities.

You *can* instead model the unknown topologies as hidden states. This approach was taken by Husmeier and McGuire (2003) and is also the approach we take in our CoalHMM method (Hobolth *et al* 2007).

This approach doesn’t scale, however, since the number of possible toplogies grows super-exponential with the number of species.

In this paper the solve the problem by using only a few topologies as states in the HMM, but sampling over all possible topologies to be used, in an MCMC approach. Ideally the number of topologies should be variable, but that requires a reversible jump MCMC and they haven’t implemented that. Still, it seems to work very well.

I remember discussing the problem with both Alex and Chris when I was last in Oxford, but back then it didn’t work so well, so I am happy to read that they’ve solved the problems. Properly handling recombination and changing topologies is important for accurate sequence analysis.

—

A. Webb, J. M. Hancock, C. C. Holmes (2008). Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, 25 (2), 197-203 DOI: 10.1093/bioinformatics/btn607

16-29=-13