Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach
We just got a new paper out – in “Advanced Access” at least – on coalescent hidden Markov models:
Ancestral population genomics: the coalescent hidden Markov approach
J. Dutheil et al.
Genetics
Abstract
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. The amount of ILS depends on population parameters such as the ancestral effective population sizes and the recombination rate, but also on the number of generations between speciation events. We use a hidden Markov model parametrized according to coalescent theory in order to infer the genealogy along a four-species genome alignment of closely related species, and estimate population parameters. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions, and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias, and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment can be recovered efficiently together with the ancestral recombination rate. Finally, we introduce an extension of the basic model that allows for mutation rate heterogeneity, and reanalyze Human-Chimpanzee-Gorilla-Orangutan alignments using the new models. We expect that this framework will prove useful for population genomics and provide exciting insights into genome evolution.
This paper has been a long time underway. Pretty much Julien’s entire post doc, actually, but there are some upcoming application papers that still makes all this work worthwhile.
There are two main results in the paper.
First, a new parameterisation of the hidden Markov model that directly parameterises the HMM in terms of population genetic parameters such as effective population size and recombination rate. This is mainly the work of Ganesh and Marcy, we collaborated with on this paper. In our 2007 paper, we parameterised the model just like any hidden Markov model but then extracted population genetics parameters from estimated transition and emission probabillities; in this paper we can do maximum likelihood parameter estimation directy from the coalescent process.
Second, we have a much more detailed simulation validation of the model. From extensive simulations we have validated the model and discovered its strengths and weaknesses. Of the latter, especially of importance is various biases in parameter estimates. We discovered some systematic biases in estimates of speciation time and, especially, recombination rate. The latter we didn’t even consider in the first paper, but the bias in the former probably means that the speciation time estimate of human and chimp in our 2007 paper was biased and somewhat more recent that the real speciation time.
Julien came up with a simulation approach to alleviate the biases, and although this approach is somewhat time consuming it does seem to improve the estimates.
While the paper was in review, we identified some of the sources of the biases, and we now have a model that looks much less biased than the one in the paper. It doesn’t completely remove the bias on the recombination rate, but is much better at estimating the other parameters. It is based on the continuous time Markov models I have described here and here, but results are still somewhat preliminary and the model can only deal with two genomes and not incomplete lineage sorting, so it is still a long way from handling data like that in this new Genetics paper.
We have a draft of a paper describing the new method, and some results for the orangutang genome project that will probably be out later this year or early next year, so I will not go into details about it here. There are still a lot of details to work out on that model before we know exactly how well it performs compared to the old one.
Anyway, the work we did on this paper told us a lot about the coalescent hidden Markov model approach. Mainly good stuff at that, the biases aside. It is a very fast method – fast enough to analyze full genomes – and is pretty good at estimating speciation times. The latter is somewhat problematic when the average genomic divergence time varies significantly from the speciation time due to large effective population sizes, so the new model should be much better at it than plain old “molecular clock” estimates.
–
- Dutheil, J., Ganapathy, G., Hobolth, A., Mailund, T., Uyenoyama, M., & Schierup, M. (2009). Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach Genetics DOI: 10.1534/genetics.109.103010
- Hobolth, A., Christensen, O., Mailund, T., & Schierup, M. (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model PLoS Genetics, 3 (2) DOI: 10.1371/journal.pgen.0030007
187-192=-5
July 10th, 2009 at 1:38 pm
[...] Thomas Mailund is an author on a new paper, Ancestral Population Genomics: The Coalescent Hidden Markov Model [...]