Posts Tagged ‘comparative genomics’

An MCMC-EM algorithm for statistical analysis of DNA evolution with neighbour-dependent substitution rates

Friday, May 2nd, 2008

ResearchBlogging.org

When estimating evolutionary parameters from DNA sequences, we usually assume that each site evolves independent of all the others. This is a choice of convenience — the mathematics is a lot simpler this way — and doesn’t actually match the data that well. In fact, we know several cases where the neighbouring sites greatly influence the evolutionary process. Just think of codons or CpG di-nucleotides.

The reason we like to consider the sites independent is that it gives us a very manageable model for the sequence evolution. Each site can be modelled as a continuous time finite state Markov model with four states and we can easily derive substitution probabilities from this. When the sites are not independent, then this idea breaks down. We cannot split the sequence into separate parts, so although we can still model the sequence evolution as a finite state continuous Markov model, the model now needs to consider the entire sequence as a whole. Instead of a 4-state Markov model we end up with a 4n-state Markov model, where n is the number of nucleotides. This is no longer all that manageable.

One of my colleagues at BiRC, Asger Hobolth, just published a paper addressing this problem:

A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates

Asger Hobolth

Journal of Computational & Graphical Statistics, Volume 17, Number 1, 2008 , pp. 138-162(25)

Abstract

The evolution of DNA sequences can be described by discrete state continuous time Markov processes on a phylogenetic tree. We consider neighbor-dependent evolutionary models where the instantaneous rate of substitution at a site depends on the states of the neighboring sites. Neighbor-dependent substitution models are analytically intractable and must be analyzed using either approximate or simulation-based methods. We describe statistical inference of neighbor-dependent models using a Markov chain Monte Carlo expectation maximization (MCMC-EM) algorithm. In the MCMC-EM algorithm, the high-dimensional integrals required in the EM algorithm are estimated using MCMC sampling. The MCMC sampler requires simulation of sample paths from a continuous time Markov process, conditional on the beginning and ending states and the paths of the neighboring sites. An exact path sampling algorithm is developed for this purpose.

He did the work while at North Carolina State University, though, and not at BiRC, so I have no inside information and I just read the paper today and I’m in Oxford so I haven’t been able to discuss it with him yet. Anyway, I’ll try to briefly describe the results here, and maybe go back and correct any misunderstandings later, after I’ve discussed with Asger.

The problem with neighbour dependencies

The problem is actually quite easy to understand. If the substitution rate of one nucleotide depends on its neighbours — say CpG changes to TpG or CpA faster than C changes to T or G to A in general — then, when modelling the substitution rate of one of the nucleotides, I need to know what the second is. If the second nucleotide didn’t change over time, then there wouldn’t be a problem, and I could model the substitution rate of the first as a Markov chain and derive a 4×4 substitution probability matrix. But there is no reason to think that one nucleotide can evolve but the other will just stay put, so I need to model the evolution of both. At any point in time, I need to capture their joint state. So I would have to work with a 16×16 rate and probability matrix.

Of course, if there are three nucleotides, then the first and second might be dependent while the second and third are dependent, and that indirectly makes all three dependent of each other. I always need to model the full state, so I have 4n-states and need to work with a 4nx4n rate matrix.

Theoretically, I just need to exponentiate the rate matrix, but in practise you cannot exponentiate a 4nx4n matrix for any realistic sequence length n.

Sampling your way out of the problem

Asger solves this problem by calculating substitution probabilities by sampling instead of matrix exponentiation. Sampling doesn’t get rid of the problem of having to model the state of the entire string over time, of course, so he needs two tricks.

First he derives a way of sampling the evolutionary history of one nucleotide, conditional on the history of its neighbours. Some of the details here are not clear to me yet, so I’ll refrain from saying more — I might add details later after I’ve talked to Asger. Anyway, he derives a way of sampling the evolution of a single nucleotide, conditional on all other histories.

Secondly, and this now is the easy part, he constructs an MCMC to then sample from all histories. This is easy because if you can sample from conditionals like he can, then you can just use the Gibbs sampling machinery and Bob’s your uncle.

Ok, so now he can sample paths between two DNA sequences which means he can calculate expectations over the set of all paths. From these expectations he can then estimate maximum likelihood parameters using the expectation maximisation (EM) machinery.

There’s a few more tricks needed to deal with multiple sequences, but essentially, that is the idea.

No more from me now, I’m off to the pub with the Oxford guys…


Hobolth, A. (2008). A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates. Journal of Computational & Graphical Statistics, 17(1), 138-162. DOI: 10.1198/106186008X289010

What is a genome?

Thursday, May 1st, 2008

At Genomicron there is an interesting discussion about the definition of a genome.  The main point is two different definitions: the genome of a species — the uniqueness of the species compared to other species’ genomes — and the genome of an individual — the unique sequence(s) of an individual including within species variation.

Definition one, the “species genome”, is a reference or “average” genome of the individuals in the species, the genomes in definition two.

It is an interesting read, I suggest you have a look.

Alignment bias in genomics

Tuesday, January 29th, 2008

I have previously written a bit about how optimal alignment algorithms introduce an alignment bias and even done some work on it myself (currently submitted for publication, so I cannot link to it yet). Today I saw a paper in the current issue of Science addressing the same problem.

A summary can be found in

Lining Up to Avoid Bias

Antonis Rokas

Science Vol. 319. no. 5862, pp. 416 – 417

and the full paper (probably requires a subscription) is

Alignment Uncertainty and Genomic Analysis

Karen M. Wong, Marc A. Suchard, and John P. Huelsenbech

Science Vol. 319. no. 5862, pp. 473 – 476

The problem with alignments

I’ve already described the problem in the previous post, where I used the examples from Gerton Lunter’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

although there the focus was on the problems with indels. Of course, without indels there simply isn’t any problem with alignment, so that is not as unreasonable as it might sound.

Essentially, the problem is that we use algorithms to infer optimal alignments and then treat these alignments as absolute truth, ignoring the uncertainty in the inference.

In Wong et al. they compare seven different alignment algorithms and consider typical evolutionary analysis — inference of phylogenies and detecting selection — based on the inferred alignments, and see a large variability of analysis result dependent on inference method.

The solution proposed in Wong et al. is the same as Gerton proposes: statistical alignmentet methods. Quoting Wong et al.:

The problem of alignment uncertainty in genomic studies, identified here, is not a problem of sloppy analysis. Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place.

In a comparative genomics study, we advocate that alignment be treated as a random variable, and inferences of parameters of interest to the genomicist, such as the amount of nonsynonymous divergence or the phylogeny, consider the different possible alignments in proportion to their probability.

Of course, this is what the statistical alignment people in Oxford have been trying for years and it is not quite as easy as it sounds.


Citations, for Research Blogging:Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532

Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes

Wednesday, January 9th, 2008

ResearchBlogging.org

Today, while preparing for a thesis meeting with Ricky, I read Gerton’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

Abstract

Motivation: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes.

Results: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around gamma = 8, rather than the currently accepted value of about gamma = 14.

I knew the results before, from discussions with Gerton, but this is the first time I’ve actually read it.The paper concerns the biases in placing gaps in alignment algorithms (whether probabilistic or parsimony based) and how these will tend to underestimate the number of indels in the true alignment and thus the indel rate.

Gap errors

The problem with gaps is that it is almost always better to have a few extra substitutions compared to a few extra gaps, since indels are less frequent and so the occurrence of them are less likely. When maximising the likelihood of the alignment, we therefore tend to remove gaps that should be there (even unlikely events do occur from time to time) and instead adds substitutions that should be there.

Unbiased estimator

Using statistical alignment and posterior decoding Gerton derives another estimator for the indel rate and shows that this essentially removes the bias. The essential idea is that when the alignment is derived through the statistical alignment algorithm, areas where gaps are misplaced will have a lower posterior certainty. The optimal alignment that is derived is not significantly more likely than several others, so the posterior probability of that exact alignment is less than it would be if placement of the gaps was more certain.

The new estimator is the red line on the plot on the right. The blue is what you would get if you just trusted the most likely alignment. The green line you get by fitting the neutral indel model from his earlier paper Genome-Wide Identification of Human functional DNA Using a Neutral Indel Model Lunter, Ponting and Hein, Plos Computational Biology 2006.

The reason the bias only shows when the substitutation rate is rather high is, of course, that you are less likely to mistake non-homologous sequences as homologous when mis-placing a gap if you have a low sequence identity on the true alignment compared to when you have a high sequence identity, i.e. when you have a low substitution rate.


The citation, for Research Blogging:
Lunter, G. (2007). Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics, 23(13), i289-i296. DOI: 10.1093/bioinformatics/btm185