An MCMC-EM algorithm for statistical analysis of DNA evolution with neighbour-dependent substitution ratesFriday, May 2nd, 2008
When estimating evolutionary parameters from DNA sequences, we usually assume that each site evolves independent of all the others. This is a choice of convenience -- the mathematics is a lot simpler this way -- and doesn't actually match the data that well. In fact, we know several cases where the neighbouring sites greatly influence the evolutionary process. Just think of codons or CpG di-nucleotides.
The reason we like to consider the sites independent is that it gives us a very manageable model for the sequence evolution. Each site can be modelled as a continuous time finite state Markov model with four states and we can easily derive substitution probabilities from this. When the sites are not independent, then this idea breaks down. We cannot split the sequence into separate parts, so although we can still model the sequence evolution as a finite state continuous Markov model, the model now needs to consider the entire sequence as a whole. Instead of a 4-state Markov model we end up with a 4n-state Markov model, where n is the number of nucleotides. This is no longer all that manageable.
One of my colleagues at BiRC, Asger Hobolth, just published a paper addressing this problem:
A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates
Journal of Computational & Graphical Statistics, Volume 17, Number 1, 2008 , pp. 138-162(25)
The evolution of DNA sequences can be described by discrete state continuous time Markov processes on a phylogenetic tree. We consider neighbor-dependent evolutionary models where the instantaneous rate of substitution at a site depends on the states of the neighboring sites. Neighbor-dependent substitution models are analytically intractable and must be analyzed using either approximate or simulation-based methods. We describe statistical inference of neighbor-dependent models using a Markov chain Monte Carlo expectation maximization (MCMC-EM) algorithm. In the MCMC-EM algorithm, the high-dimensional integrals required in the EM algorithm are estimated using MCMC sampling. The MCMC sampler requires simulation of sample paths from a continuous time Markov process, conditional on the beginning and ending states and the paths of the neighboring sites. An exact path sampling algorithm is developed for this purpose.
He did the work while at North Carolina State University, though, and not at BiRC, so I have no inside information and I just read the paper today and I'm in Oxford so I haven't been able to discuss it with him yet. Anyway, I'll try to briefly describe the results here, and maybe go back and correct any misunderstandings later, after I've discussed with Asger.
The problem with neighbour dependencies
The problem is actually quite easy to understand. If the substitution rate of one nucleotide depends on its neighbours -- say CpG changes to TpG or CpA faster than C changes to T or G to A in general -- then, when modelling the substitution rate of one of the nucleotides, I need to know what the second is. If the second nucleotide didn't change over time, then there wouldn't be a problem, and I could model the substitution rate of the first as a Markov chain and derive a 4x4 substitution probability matrix. But there is no reason to think that one nucleotide can evolve but the other will just stay put, so I need to model the evolution of both. At any point in time, I need to capture their joint state. So I would have to work with a 16x16 rate and probability matrix.
Of course, if there are three nucleotides, then the first and second might be dependent while the second and third are dependent, and that indirectly makes all three dependent of each other. I always need to model the full state, so I have 4n-states and need to work with a 4nx4n rate matrix.
Theoretically, I just need to exponentiate the rate matrix, but in practise you cannot exponentiate a 4nx4n matrix for any realistic sequence length n.
Sampling your way out of the problem
Asger solves this problem by calculating substitution probabilities by sampling instead of matrix exponentiation. Sampling doesn't get rid of the problem of having to model the state of the entire string over time, of course, so he needs two tricks.
First he derives a way of sampling the evolutionary history of one nucleotide, conditional on the history of its neighbours. Some of the details here are not clear to me yet, so I'll refrain from saying more -- I might add details later after I've talked to Asger. Anyway, he derives a way of sampling the evolution of a single nucleotide, conditional on all other histories.
Secondly, and this now is the easy part, he constructs an MCMC to then sample from all histories. This is easy because if you can sample from conditionals like he can, then you can just use the Gibbs sampling machinery and Bob's your uncle.
Ok, so now he can sample paths between two DNA sequences which means he can calculate expectations over the set of all paths. From these expectations he can then estimate maximum likelihood parameters using the expectation maximisation (EM) machinery.
There's a few more tricks needed to deal with multiple sequences, but essentially, that is the idea.
No more from me now, I'm off to the pub with the Oxford guys...
Hobolth, A. (2008). A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates. Journal of Computational & Graphical Statistics, 17(1), 138-162. DOI: 10.1198/106186008X289010