## On segment lengths, going back in time, in the coalescence process. Part 1: The ancestry of a single species

Last week I wrote a piece on gene trees and species trees. The message there was that incomplete lineage sorting can result in changing topologies when you consider the genealogy along the genomes of closely related species.

This week I am going to write a bit about the coalescence process, and how it distributes segments of different lengths at different coalescence times back in time. I will only consider the cases of a single or two species, so not how the process leads to different topologies. That will be the topic of a later post.

I will also explain how you can use a simple Markov model to get estimates for segment lengths. This is a simplifying approximation to the coalescence process. The coalescence process is Markov as a process in time, but not as a process along a genome. Nevertheless, you can get reasonable estimates by assuming that it is Markov along the genomes as well, which is good news considering that Markov models, and hidden Markov models, are frequently used for genome analysis.

The "Markov along the genome" simplification is also at the heart of our CoalHMM framework (Hobolth *et al.* 2007).

I will link to some papers that consider the process without the "Markov along the sequence" assumption so you can check that out for yourself.

### The ancestry of a single sequence

When we analyse sequences, we can never learn about what happened before their most recent common ancestor (MRCA).

Well, *never* is a strong word, and there is actually a secret assumption about the process being Markov and time homogeneous here, but that is far beyond the topic of this post. What I mean is that while we can use extant sequences to learn about the evolution that lead from their MRCA to the present sequences, we cannot learn about the evolution that lead to the MRCA from the more distant past.

So it might not make a lot of sense to consider the ancestry of a single extant sequence, and how evolves (devolves?) back in time. Still, this system will be important when we later consider the system with two species, so we consider the single sequence case first.

This case was analysed in Wiuf & Hein (1997). The setup is as follows. We consider a single sequence of length \(L\) and track it back in time. As we do this, we will see it break up into pieces due to recombination events, and see segments join up again due to coalescence events.

The process, running back in time, is a continuous time Markov process where the current state (the number of lineages the initial sequence is distributed on, and where the segments are) determines what can happen as the next event back in time.

For mathematical convenience we always work in time units of \(2N_e\) generations where \(N_e\) is the so called *effective population size*. If we do this, the rate with which we see coalescence events, when our initial sequence is spread out on \(k\) separate lineages is \(\binom{k}{2}\) and the rate with which we see recombination events is \(k\cdot\frac{\rho}{2}\) where \(\rho=4N_e(L-1)r\) and \(r\) is the per-nucleotide per generation recombination rate (the rate with which two neighbouring nucleotides are separated by recombination events).

We will see our initial sequence be broken up into smaller and smaller segments on different lineages, for a while, until the process starts to reach an equilibrium and the mean segment length approaches an equilibrium.

A simulation of this process, with \(L=100,000bp\), \(N_e=10,000\) and \(r=10^{-8}\) is shown below (click on the figure to get a larger version):

### Approximating the process with a two-state CTMC

If we just consider two nucleotides, so \(L=2\), we have a small two-state continuous time Markov chain (CTMC). If the nucleotides are linked, they can split up \(\bullet-\bullet \to \bullet\ \ \bullet\) with rate \(2N_e r\), and if the nucleotides are unlinked they can link up \(\bullet\ \ \bullet\to\bullet-\bullet\) with rate 1. This gives us the rate matrix

\[\mathbf{Q} = \left( \begin{array}{cc}-2N_e r & 2N_e r\\1 & -1\end{array}\right)\]

and to get the probability of moving from linked to unlinked nucleotides, or unlinked to linked nucleotides, in time \(t\) we can calculate the transition matrix \(\mathbf{P}=\exp(t\mathbf{Q})\).

From this system, we can calculate the probability of two neighbouring nucleotides being linked or unlinked as a function of time. If we assume that they start out being linked, which will be the case if we trace an extant sequence back in time, we get a plot like this:

where we see that the probabilities approaches an equilibrium in a few \(2N_e\) generations.

The probability of the two nucleotides being linked is dropping off at the beginning, but at equilibrium it is still much higher than the probability of being unlinked, so we still expect pretty long segments at equilibrium.

The process is not Markov along the nucleotides, in the sense that the probability of three nucleotides being linked is not the product of the probability of two nucleotides being linked \(P(\bullet-\bullet-\bullet)\neq P(\bullet-\bullet)P(\bullet-\bullet)\).

To see this, we can consider the three nucleotide CTMC. It has five states: 1) where all three nucleotides are linked, 2) where the first two nucleotides are linked but the third is not, 3) where the first is unlinked from the second and third, who are linked, 4) where all three nucleotides are unlinked, and 5) where the first and third nucleotides are on the same lineage but the second is on a different lineage.

The CTMC for this system has the rate matrix

\[\mathbf{Q}=\left(\begin{array}{ccccc}- & 2N_e r & 2N_e r & 0 & 0\\ 1 & - & 0 & 2N_e r & 0 \\ 1 & 0 & - & 2N_e r & 0 \\ 0 & 1 & 1 & - & 1\\ 1 & 0 & 0 & 2N_e r & -\end{array}\right)\]

where \(-\) in all cases should be read as minus the sum of the other entries in the row.

Below I plot, as a function of time, \(P(\bullet-\bullet)P(\bullet-\bullet) / P(\bullet-\bullet-\bullet)\)

if \(P(\bullet-\bullet-\bullet) = P(\bullet-\bullet)P(\bullet-\bullet)\) we should see a line at 1, but clearly we do not.

If we want to analyse genomes with Markov models, or hidden Markov models, we still need to assume it, though. Lucky for us, it turns out that assuming the Markov property still allow us to reasonably accuratly calculate the mean segment length as a function of time.

If we assume the Markov property, the segment lengths will be geometrically distributed, so the probability of seeing a segment of length \(k\) is \(P(\bullet-\bullet)^k P(\bullet\ \ \bullet)\) and the mean length will be \(\frac{1}{P(\bullet\ \ \bullet)}\). In the plot below, the red line shows the mean segment length under this assumption on top of the simulation results I showed earlier.

As you can see, the CTMC predicted mean is pretty close to the mean under the coalescence process.

### Consequences of this, when comparing two species

If we compare the genomes of two species we have to keep this process in mind. The coalescence with recombination process is going on in the individual species lineages back in time, to their ancestral species.

Even if we start out with only two sequences, one from each species, they will be distributed on a number of different segments at the time they reach the ancestral species. So even if we assume that no further recombination is going on in the ancestral species, we would still see different coalescence times along the sequences, just because the segments that reach the ancestral species will be distributed on different lineages.

The distribution of segment lengths when the two initial sequences make it into the ancestral species will not necessarily be the same. The coalescence rate was swept under the carpet in the calculations above, because we work in time units of \(2N_e\) generations, but if the two species have different effective population sizes, the equilibrium distribution of the CTMC will be different for the two species, and so will the mean segment length.

The larger the effective population size, the shorter segments we expect (and the longer it will take the CTMC to approach equilibrium).

The process going on in the ancestral species, when we start out with two genomes, is quite a bit more involved, but that is the topic for my next post on the coalescence process.

### References

- Asger Hobolth, Ole F. Christensen, Thomas Mailund, Mikkel H. Schierup (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model PLoS Genetics, 3 (2)
- Carsten Wiuf, Jotun Hein (1997). On the number of ancestors to a DNA sequence (PDF) Genetics, 147, 1459-1468

--

54-75=-21

February 27th, 2009 at 9:25 am

[...] I wrote about the coalescence process when tracking a single lineage back in time. Today I will write a bit about the process when dealing with two sequences, and the consequences [...]