Posts Tagged ‘coalHMM’

On segment lengths, going back in time, in the coalescence process. Part 1: The ancestry of a single species

Monday, February 23rd, 2009

Last week I wrote a piece on gene trees and species trees.  The message there was that incomplete lineage sorting can result in changing topologies when you consider the genealogy along the genomes of closely related species.

This week I am going to write a bit about the coalescence process, and how it distributes segments of different lengths at different coalescence times back in time.  I will only consider the cases of a single or two species, so not how the process leads to different topologies.  That will be the topic of a later post.

I will also explain how you can use a simple Markov model to get estimates for segment lengths.  This is a simplifying approximation to the coalescence process.  The coalescence process is Markov as a process in time, but not as a process along a genome.  Nevertheless, you can get reasonable estimates by assuming that it is Markov along the genomes as well, which is good news considering that Markov models, and hidden Markov models, are frequently used for genome analysis.

The "Markov along the genome" simplification is also at the heart of our CoalHMM framework (Hobolth et al. 2007).

I will link to some papers that consider the process without the "Markov along the sequence" assumption so you can check that out for yourself.

The ancestry of a single sequence

When we analyse sequences, we can never learn about what happened before their most recent common ancestor (MRCA).

Well, never is a strong word, and there is actually a secret assumption about the process being Markov and time homogeneous here, but that is far beyond the topic of this post. What I mean is that while we can use extant sequences to learn about the evolution that lead from their MRCA to the present sequences, we cannot learn about the evolution that lead to the MRCA from the more distant past.

So it might not make a lot of sense to consider the ancestry of a single extant sequence, and how evolves (devolves?) back in time.  Still, this system will be important when we later consider the system with two species, so we consider the single sequence case first.

This case was analysed in Wiuf & Hein (1997).  The setup is as follows.  We consider a single sequence of length \(L\) and track it back in time.  As we do this, we will see it break up into pieces due to recombination events, and see segments join up again due to coalescence events.

The process, running back in time, is a continuous time Markov process where the current state (the number of lineages the initial sequence is distributed on, and where the segments are) determines what can happen as the next event back in time.

For mathematical convenience we always work in time units of \(2N_e\) generations where \(N_e\) is the so called effective population size.  If we do this, the rate with which we see coalescence events, when our initial sequence is spread out on \(k\) separate lineages is \(\binom{k}{2}\) and the rate with which we see recombination events is \(k\cdot\frac{\rho}{2}\) where \(\rho=4N_e(L-1)r\) and \(r\) is the per-nucleotide per generation recombination rate (the rate with which two neighbouring nucleotides are separated by recombination events).

We will see our initial sequence be broken up into smaller and smaller segments on different lineages, for a while, until the process starts to reach an equilibrium and the mean segment length approaches an equilibrium.

A simulation of this process, with \(L=100,000bp\), \(N_e=10,000\) and \(r=10^{-8}\) is shown below (click on the figure to get a larger version):

Approximating the process with a two-state CTMC

If we just consider two nucleotides, so \(L=2\), we have a small two-state continuous time Markov chain (CTMC).  If the nucleotides are linked, they can split up \(\bullet-\bullet \to \bullet\ \ \bullet\) with rate \(2N_e r\), and if the nucleotides are unlinked they can link up \(\bullet\ \ \bullet\to\bullet-\bullet\) with rate 1.  This gives us the rate matrix

\[\mathbf{Q} = \left( \begin{array}{cc}-2N_e r & 2N_e r\\1 & -1\end{array}\right)\]

and to get the probability of moving from linked to unlinked nucleotides, or unlinked to linked nucleotides, in time \(t\) we can calculate the transition matrix \(\mathbf{P}=\exp(t\mathbf{Q})\).

From this system, we can calculate the probability of two neighbouring nucleotides being linked or unlinked as a function of time.  If we assume that they start out being linked, which will be the case if we trace an extant sequence back in time, we get a plot like this:

where we see that the probabilities approaches an equilibrium in a few \(2N_e\) generations.

The probability of the two nucleotides being linked is dropping off at the beginning, but at equilibrium it is still much higher than the probability of being unlinked, so we still expect pretty long segments at equilibrium.

The process is not Markov along the nucleotides, in the sense that the probability of three nucleotides being linked is not the product of the probability of two nucleotides being linked \(P(\bullet-\bullet-\bullet)\neq P(\bullet-\bullet)P(\bullet-\bullet)\).

To see this, we can consider the three nucleotide CTMC.  It has five states: 1) where all three nucleotides are linked, 2) where the first two nucleotides are linked but the third is not, 3) where the first is unlinked from the second and third, who are linked, 4) where all three nucleotides are unlinked, and 5) where the first and third nucleotides are on the same lineage but the second is on a different lineage.

The CTMC for this system has the rate matrix

\[\mathbf{Q}=\left(\begin{array}{ccccc}- & 2N_e r & 2N_e r & 0 & 0\\ 1 & - & 0 & 2N_e r & 0 \\ 1 & 0 & - & 2N_e r & 0 \\ 0 & 1 & 1 & - & 1\\ 1 & 0 & 0 & 2N_e r & -\end{array}\right)\]

where \(-\) in all cases should be read as minus the sum of the other entries in the row.

Below I plot, as a function of time, \(P(\bullet-\bullet)P(\bullet-\bullet) / P(\bullet-\bullet-\bullet)\)

if \(P(\bullet-\bullet-\bullet) = P(\bullet-\bullet)P(\bullet-\bullet)\) we should see a line at 1, but clearly we do not.

If we want to analyse genomes with Markov models, or hidden Markov models, we still need to assume it, though.  Lucky for us, it turns out that assuming the Markov property still allow us to reasonably accuratly calculate the mean segment length as a function of time.

If we assume the Markov property, the segment lengths will be geometrically distributed, so the probability of seeing a segment of length \(k\) is \(P(\bullet-\bullet)^k P(\bullet\ \ \bullet)\) and the mean length will be \(\frac{1}{P(\bullet\ \ \bullet)}\).  In the plot below, the red line shows the mean segment length under this assumption on top of the simulation results I showed earlier.

As you can see, the CTMC predicted mean is pretty close to the mean under the coalescence process.

Consequences of this, when comparing two species

If we compare the genomes of two species we have to keep this process in mind.  The coalescence with recombination process is going on in the individual species lineages back in time, to their ancestral species.

Even if we start out with only two sequences, one from each species, they will be distributed on a number of different segments at the time they reach the ancestral species.  So even if we assume that no further recombination is going on in the ancestral species, we would still see different coalescence times along the sequences, just because the segments that reach the ancestral species will be distributed on different lineages.

The distribution of segment lengths when the two initial sequences make it into the ancestral species will not necessarily be the same.  The coalescence rate was swept under the carpet in the calculations above, because we work in time units of \(2N_e\) generations, but if the two species have different effective population sizes, the equilibrium distribution of the CTMC will be different for the two species, and so will the mean segment length.

The larger the effective population size, the shorter segments we expect (and the longer it will take the CTMC to approach equilibrium).

The process going on in the ancestral species, when we start out with two genomes, is quite a bit more involved, but that is the topic for my next post on the coalescence process.

References

  1. Asger Hobolth, Ole F. Christensen, Thomas Mailund, Mikkel H. Schierup (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model PLoS Genetics, 3 (2)
  2. Carsten Wiuf, Jotun Hein (1997). On the number of ancestors to a DNA sequence (PDF) Genetics, 147, 1459-1468

--

54-75=-21

Can't wait to get my hands on the Neanderthal genome...

Wednesday, February 4th, 2009

This is so cool!

I can't wait to get my hands on the Neanderthal and run it through our CoalHMM analysis!

Yeah, we are pretty swamped with worn on that project already, but this will just be so cool I can't wait.

--

35-58=-23

Eureka!

Friday, January 30th, 2009

I feel both very stupid and very clever this morning.

In our CoalHMM methodology we look for patterns of incomplete lineage sorting from which we can infer population genetics parameters from ancestral species and information about speciation times (as opposed to divergence times).

For a while now, more than a year, we've been working on a different parameterisation of the model that directly lets us work on the parameters of interest.  In the former model (the PLoS Genetics paper I linked to above) we extract the parameters in post processing.  In the new parameterisation we work on them directly and can easier put constraint on them to match prior knowledge to the model, such as present day effective population sizes or speciation times from paleontological data.

One of our post docs has completely re-implemented the model plus made it computationally tractable to scan complete genomes with it.  It's been a lot of work, but we now have 3-4 papers in the pipeline; a couple of which within weeks of submission.  Which is good, since we haven't published anything on our work for this long while.

Anyway, there is an issue in the mathematical model that has bothered me the whole while.  I won't go into details here until the paper is out and you can check it yourself, but it has to do with some of the assumptions made to make the math tractable.  After all, we are reducing the rather complex process of coalescence with recombination to a Markov model along the sequence, which it isn't reall, so some simplifications are necessary.

The process is a Markov chain when running in time, but when running along the sequence it is not since the history of each nucleotide depends on all the others to some degree.  It is just that this relationship drops off with distance, so we can approximate it somewhat with a Markov chain.

One of the simplifications we have to model the process along the sequence as a Markov chain is to assume that the past histories of two nucleotides are independent after they have been unlinked by a recombination.  This isn't unreasonable for short time scales, but for longer time scales it is not quite right either; after a recombination two neighbouring nucleotides can coalesce back together again and the result when seen today is exactly as if they never recombined in the first place.

This way we get an invisible occurence, and if we ignore those we will underestimate recombination rate (because we observe fewer recombinations than really occur).

We know that this happenes, of course, but we sort of hack our way out of it.  It does mean that we will still underestimate recombination for long branches in the species phylogeny (as opposed to the coalescence genealogy where we have other problems as well).

I did some simulations that showed that it wasn't much of a problem on the parameters we use.  We do see a bias that we have to correct for, but based on the simulations we decided that it probably wasn't because of this effect and left it at that.

It still bugs me, so I've been thinking about dealing with the problem but got stuck.

Until yesterday, where I discussed it with Carsten Wiuf who has done a lot of work on coalescence theory.  After I explained the problem to him, he asked if I wasn't just modelling a continuous time Markov chain.  Doh!

Yes, that is exactly what it is, and I knew that. I have even used that approach in a similar project.  I just didn't think about it in this case.

You can easily model the process as such, and though it gets complicated with more lineages it is not too hard to do (as long as you still keep the "Markov along the sequence" assumption).

With just this one insight, I've managed last night to reformulate the problem plus make the math much simpler.  There are still a lot of details to get right, of course, but I am very optimistic about it.

It is just too bad that it is this late in the process.  We don't have time to re-do a years work and still contribute to the data analysis projects we are involved with, but I hope that the new approach can improve the CoalHMM methodology in the future.

--

30-49=-19

Keynote talk on coalescent hidden Markov models and great ape speciation

Monday, June 9th, 2008

Wednesday this week I'm giving a talk at the Danish Society for Computer Science. I was asked to give the keynote talk at this years general assembly before they hand out this year's Best Thesis Award.

I'm tired of talking about my main project -- association mapping -- so I decided to give a talk on our CoalHMM work (although I am only working on that a small percentage of my time).

The slides file is too big to upload to slideshare, but if my flash hack works you should see them below:

You can also get it as a PDF file, a PowerPoint file or a Keynote file, if you want.

The last couple of presentations I've made I have been using the Keynote program on my Mac. I'm really happy with the program. It makes it very easy to put together pretty presentations (if I say so myself) with very little work. Especially its handling of graphics impresses me. Compared to using OpenOffice, as I usually did before getting the Mac, there's essentially no work involved in including graphics. Automatic masking and alpha channels beats modifying images in Gimp any day!

I couldn't really figure out how to put presentations on my homepage, though. Keynote files are really directories, so you cannot just copy them to the homepage directory. I found out, though, that if a Mac downloads a zip'ed Keynote file it will just open it in Keynote, so I guess that is the way to do that.

How do you calibrate the molecular clock?

Thursday, May 29th, 2008

How do you calibrate the molecular clock -- where you need a few known sequence divergence times -- when you only know a few speciation times?

Yesterday at a meeting (I'm not sure I can tell you which meeting; I'm not sure how open it is supposed to be :-/) we discussed the divergence time of human-orangutan and human-macaque. We need the sequence divergence time to calibrate a CoalHMM model for figuring out some speciation and population genetics parameters of ancestral species.

No definitive answer came up at the meeting, but there was a short discussion by email after the meeting. This paper was sent around, where the divergence times were estimated to 25MYA and 13MYA, respectively, although the last of those numbers is actually the calibration point used in the analysis, so it is an assumption more than an estimate.

The problem is, the 13MYA used for the calibration is based on fossil evidence, and as far as I can see, that would make it an estimate for the speciation time between human and orangutan. We need the sequence divergence time. Speciation time and divergence time can vary with millions of years (if the effective population size is large enough).

If 13MYA is the divergence time between human and orangutan, we get a speciation time that is unrealistically recent.  If the divergence time is 18MYA instead, as we assumed in this paper, we would get a speciation time around 12MYA which would match the MBE paper.

But how do you figure out the divergence time needed to calibrate the clock?  Is there any way to get it, rather than the speciation time, from fossil evidence?

For our purposes, I suppose we can just as well work with speciation times for our calibration, but not everyone is using CoalHMMs for their analysis, so how do you deal with this problem?