On segment lengths, going back in time, in the coalescence process. Part 2: The ancestry of two species

Previously I wrote about the coalescence process when tracking a single lineage back in time.  Today I will write a bit about the process when dealing with two sequences, and the consequences when comparing the genomes of two species.

I will start out with considering the plain old coalescence process, when we consider two sequences back in time, and at the end move to the process when we consider sequences from two different species.

The ancestry of two sequences

In Wiuf and Hein (1999) they considered the coalescence process with recombination of $$k$$ sequences, and how it distributes segments on different most recent common ancestors (MRCAs).  Here, I will just consider $$k=2$$.

If you simulate the coalescence process for two sequences, you will — just as for the single sequence — see sequences break up, due to recombination events, and merge again, due to coalescence events.

We can then ask: When do two nucleotides, one for each initial sequence, reach their MRCA?  How many MRCAs do we eventually end up with? How many nucleotides share the same MRCA?  How long are the contiguous segments on an MRCA? etc.

Below I have plotted the coalescence times, where each nucleotide pair from the two sequences, reach their MRCA.  The simulations were run with $$N_e=10000$$, sequence length $$L=100K$$ and per nucleotide recombination rate $$r=10^{-8}$$.

You cannot see this from the plot, but when you see two segments with the same $$y$$ value, they are actually sitting on the same MRCA.  That doesn’t mean that the segments on the same MRCA are contiguous.  Just to the left of the middle of the plot you see two long segments with the same MRCA, but with a short segment between them that coalesced slightly earlier.

You might notice that the smaller the coalescence time, the longer the segments.  This is generally the case; the longer you go back in time, the greater that chance that a recombination event breaks up a segment.  The broken up segments will, of course, coalesce again, but initially the coalescence events are relatively rate because there are few lineages.  The recombination events are also rare, per nucleotide, but since we are tracking a lot of nucleotides back in time, they still occur quite frequently.

Once recombination has broken the initial sequences into many fragments, the coalescence rate is higher and we see a lot of recombinations, but by now the fragments are already much shorter.

The long fragments that coalesce early are of course still part of the recombination/coalescence process, but since today we only track them up to their MRCA we don’t see that in the plot.

Wiuf and Hein considered, among other things, the number of MRCAs, the expected length between recombination breakpoints (whether they break up segments on the MRCA or not; it is still possible for two segments to re-coalesce and end up on the same contiguous MRCA segment), and the expected lengths of the contiguous segments on the MRCA.

The expected number of segments

The expected number of segments carrying ancestral, $$S_k$$ material was already derived by Griffiths and Marjoram in 1997, and is given by:

$$!E[S_k(\rho)] = 1 +\rho\left( 1 – \frac{2}{k(k+1)}\right)$$

where $$k$$ is the number of initial lineages (two in the simulation above), and $$\rho=4N_e(L-1)r$$ is the scaled recombination rate.

This clearly grows both with the number of lineages $$k$$ and the recombination rate $$\rho$$, but towards an equilibrium for $$k$$ and unbounded for $$\rho$$.

The expected number of MRCAs

Wiuf and Hein didn’t derive an expression for the expected number of MRCAs, but used simulations to compare the expected number of segments, $$E[S_k(\rho)]$$ to the expected number of MRCAs.  See the figure on the right.

Since segments can share MRCA, it is not surprising that we tend to see fewer MRCAs than segments, but notice that the ratio does not tend towards zero but a constant that is higher for small $$k$$ and lower for larger $$k$$, with the difference decreasing as $$k$$ increases.  So the number of initial lineages does not have a strong effect on the number of MRCAs, but the recombination rate clearly does.

When the scaled recombination rate increases (either because the actual recombination rate increases or because we consider longer sequences), the number of MRCAs increases proportionally with the number of segments (and with $$\rho$$).

This might trouble you a bit, since for sufficiently long sequences, you can end up with far more MRCAs than you have $$N_e$$.  Not that it is a problem that you have more individuals than $$N_e$$ — after all, the effective population size is just a parameter of our model and is usually quite a bit smaller than the actual population size — but an important assumption in the coalescence process is that you are modelling the ancestry of a sample much smaller than $$N_e$$.  With recombination, even for a small initial $$k$$ you can end up with too many lineages for this assumption to be true.

Not to worry.  As we showed in Davies et al. (2007) this doesn’t actually affect local properties like segment lengths (but does, of course, have a large effect on number of MRCAs).

The expected length between recombination points and the expected segment length

Wiuf and Hein derived expressions for the expected length between recombination points:

$$!E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}$$

and the expected segment lengths at MRCAs:

$$!E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}$$

sorry for the reuse of $$r$$ here, but this is the notation from Wiuf and Hein (1999).

Don’t worry too much about the $$B_k(0)=1$$ and $$A_k(0)=1$$, that is just their way of saying that they are measuring the length to the next recombination point / break point from the a given recombination point / break point.

Notice that both lengths are independent of the recombination rate.  This can’t be true, of course, and it isn’t.  The reason that the expressions are independent of the recombination rate is just that the lengths are measured in units of $$\rho/2$$, so really the expressions should be

$$!E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}/\frac{\rho}{2}$$


$$!E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}/\frac{\rho}{2}$$

Below I compare the expected segment lengths, $$E[s_k\,|\,A_k(0)=1]$$ (in red) with the mean segment lengths from simulations (in blue).

I am not sure exactly why the fit is so bad for the small recombination rates, but this could be an artefact from the simulation setup since 1) the simulations work with discrete nucleotides while the mathematics works with continuous segments and 2) there is an upper bound to the possible segment lengths in the simulations which means that really long segments are not possible.

The relative error (average-expected)/expected approaches zero as the recombination rate increases.

The ancestry of two species

If we consider two sequences from different species, the situation changes a bit from the above.

If we start out with two sequences in different species, the segments cannot coalesce into MRCAs initially, but still fragment due to the process running within the species branches, as I explained in my previous post.

This means that the two sequences enters the process described above, not as contiguous segments but as several fragments.  The number of fragments and the length of fragments depend on how far back in time the speciation event occurred.

Below I plot single simulations of the process where the speciation event is 0, 1, 5, and 10 times $$2N_e$$ generations back in time.

In general, we see that the number of fragments increases and the mean fragment length decreases as we push the speciation event back in time, with the effect reaching an equilibrium as the process in the single species approaches equilibrium.

The expression from Wiuf and Hein (1999) will thus tend to overestimate the mean segment length in this setting.

The plots below compare the predicted mean length, using the formula above, and the average segment length for different speciation times:

(the title is wrong on those, the actual speciation times are 0 generations ago, 2Ne generations ago, 6Ne generations ago, and 10Ne generations ago — but I don’t have time to fix that right now as I am late for teaching…)

The comparison is summarised below:

I do not have any expressions for the expected segment lenths in this setting (but I hope that Wiuf can help me work those out).

In my next post in the series, however, I will use a two nucleotide CTMC to approximate some properties about the system, similarly to what I did with the single species case.


  1. Carsten Wiuf, Jotun Hein (1999). The ancestry of a sample of sequences subject to recombination (PDF) Genetics, 151, 1217-1228
  2. J. L. Davies, F. Simancik, R. Lyngso, T. Mailund, J. Hein (2007). On Recombination-Induced Multiple and Simultaneous Coalescent Events Genetics, 177 (4), 2151-2160 DOI: 10.1534/genetics.107.071126



We have just published a paper on our Blossoc method in the latest issue of Genetics:

Local Phylogeny Mapping of Quantitative Traits: Higher Accuracy and Better Ranking Than Single-Marker Association in Genomewide Scans

Besenbacher, Mailund and Schierup, Genetics, Vol. 181, 747-753, February 2009, Copyright © 2009 doi:10.1534/genetics.108.092643


We present a new method, termed QBlossoc, for linkage disequilibrium (LD) mapping of genetic variants underlying a quantitative trait. The method uses principles similar to a previously published method, Blossoc, for LD mapping of case/control studies. The method builds local genealogies along the genome and looks for a significant clustering of quantitative trait values in these trees. We analyze its efficiency in terms of localization and ranking of true positives among a large number of negatives and compare the results with single-marker approaches. Simulation results of markers at densities comparable to contemporary genotype chips show that QBlossoc is more accurate in localization of true positives as expected since it uses the additional information of LD between markers simultaneously. More importantly, however, for genomewide surveys, QBlossoc places regions with true positives higher on a ranked list than single-marker approaches, again suggesting that a true signal displays itself more strongly in a set of adjacent markers than a spurious (false) signal. The method is both memory and central processing unit (CPU) efficient. It has been tested on a real data set of height data for 5000 individuals measured at ~317,000 markers and completed analysis within 5 CPU days.

The method works very similarly to our first paper on Blossoc.  Running along the genome, we infer local genealogies and then scores the region according to how well the genealogy explains the phenotype under consideration.

What is new in this paper is the way we score threes when the phenotype is quantitative rather than qualitative (case/control status).

Also just published this month is results from the QTLMAS XII workshop held last year.  As part of this workshop, a dataset with genome-wide genetic data and a quantitative phenotype was simulated, and groups could then compete in mapping the quantitative traits (Crooks et al. 2009).

One group used Blossoc.  No, it wasn’t us, but Ledur et al (2009).  I am pretty proud to learn that Blossoc was considered the best performing method on this data.

Crooks et al’s conclusion:

In this dataset, the best methods for detecting QTL were Blossoc [11] followed by a Bayesian linkage analysis [7], both of which used information from multiple markers to infer QTL genotypes. The two studies that aimed to increase the efficiency of QTL detection by reducing the amount of analysis had lowest power and were not effective in identifying the QTL with the largest effects. Estimates of QTL location were generally very good. There were bigger differences in how well the methods estimated the QTL effects. Here, two of the models that were most accurate used single markers in place of QTL genotype and simultaneously fit a polygenic effect. Although in this case estimates from a single locus model were as accurate as from a multilocus model, fitting multiple loci should allow closely linked QTL to be distinguished. A valuable approach might be to first locate QTL by a multimarker/haplotype method and then fit the closest markers in a multilocus model, to estimate QTL effects. For future such projects, we recommend that participants provide a list of their top-ranked effects, and report confidence intervals for QTL location and effect size estimates. Areas that we suggest for further work include significance thresholds, closely linked QTL and epistatic effects.

Actually, we do not try to estimate QTL effects in our method.  I simply hadn’t thought about that until I read the conclusions from QTLMAS XII.  Well, that leaves a topic for future work…

  1. S. Besenbacher, T. Mailund, M. H. Schierup (2008). Local Phylogeny Mapping of Quantitative Traits: Higher Accuracy and Better Ranking Than Single-Marker Association in Genomewide Scans Genetics, 181 (2), 747-753 DOI: 10.1534/genetics.108.092643
  2. Lucy Crooks, Goutam Sahana, Dirk-Jan de Koning, Mogens Sandø Lund, Örjan Carlborg (2009). Comparison of analyses of the QTLMAS XII common dataset. II: genome-wide association and fine mapping BMC Proceedings
  3. Mônica Corrêa Ledur, Nicolas Navarro, Miguel Pérez-Enciso (2009). Data modeling as a main source of discrepancies in single and multiple marker association methods BMC Proceedings


On segment lengths, going back in time, in the coalescence process. Part 1: The ancestry of a single species

Last week I wrote a piece on gene trees and species trees.  The message there was that incomplete lineage sorting can result in changing topologies when you consider the genealogy along the genomes of closely related species.

This week I am going to write a bit about the coalescence process, and how it distributes segments of different lengths at different coalescence times back in time.  I will only consider the cases of a single or two species, so not how the process leads to different topologies.  That will be the topic of a later post.

I will also explain how you can use a simple Markov model to get estimates for segment lengths.  This is a simplifying approximation to the coalescence process.  The coalescence process is Markov as a process in time, but not as a process along a genome.  Nevertheless, you can get reasonable estimates by assuming that it is Markov along the genomes as well, which is good news considering that Markov models, and hidden Markov models, are frequently used for genome analysis.

The “Markov along the genome” simplification is also at the heart of our CoalHMM framework (Hobolth et al. 2007).

I will link to some papers that consider the process without the “Markov along the sequence” assumption so you can check that out for yourself.

The ancestry of a single sequence

When we analyse sequences, we can never learn about what happened before their most recent common ancestor (MRCA).

Well, never is a strong word, and there is actually a secret assumption about the process being Markov and time homogeneous here, but that is far beyond the topic of this post. What I mean is that while we can use extant sequences to learn about the evolution that lead from their MRCA to the present sequences, we cannot learn about the evolution that lead to the MRCA from the more distant past.

So it might not make a lot of sense to consider the ancestry of a single extant sequence, and how evolves (devolves?) back in time.  Still, this system will be important when we later consider the system with two species, so we consider the single sequence case first.

This case was analysed in Wiuf & Hein (1997).  The setup is as follows.  We consider a single sequence of length $$L$$ and track it back in time.  As we do this, we will see it break up into pieces due to recombination events, and see segments join up again due to coalescence events.

The process, running back in time, is a continuous time Markov process where the current state (the number of lineages the initial sequence is distributed on, and where the segments are) determines what can happen as the next event back in time.

For mathematical convenience we always work in time units of $$2N_e$$ generations where $$N_e$$ is the so called effective population size.  If we do this, the rate with which we see coalescence events, when our initial sequence is spread out on $$k$$ separate lineages is $$\binom{k}{2}$$ and the rate with which we see recombination events is $$k\cdot\frac{\rho}{2}$$ where $$\rho=4N_e(L-1)r$$ and $$r$$ is the per-nucleotide per generation recombination rate (the rate with which two neighbouring nucleotides are separated by recombination events).

We will see our initial sequence be broken up into smaller and smaller segments on different lineages, for a while, until the process starts to reach an equilibrium and the mean segment length approaches an equilibrium.

A simulation of this process, with $$L=100,000bp$$, $$N_e=10,000$$ and $$r=10^{-8}$$ is shown below (click on the figure to get a larger version):

Approximating the process with a two-state CTMC

If we just consider two nucleotides, so $$L=2$$, we have a small two-state continuous time Markov chain (CTMC).  If the nucleotides are linked, they can split up $$\bullet-\bullet \to \bullet\ \ \bullet$$ with rate $$2N_e r$$, and if the nucleotides are unlinked they can link up $$\bullet\ \ \bullet\to\bullet-\bullet$$ with rate 1.  This gives us the rate matrix

$$!\mathbf{Q} = \left( \begin{array}{cc}-2N_e r & 2N_e r\\1 & -1\end{array}\right)$$

and to get the probability of moving from linked to unlinked nucleotides, or unlinked to linked nucleotides, in time $$t$$ we can calculate the transition matrix $$\mathbf{P}=\exp(t\mathbf{Q})$$.

From this system, we can calculate the probability of two neighbouring nucleotides being linked or unlinked as a function of time.  If we assume that they start out being linked, which will be the case if we trace an extant sequence back in time, we get a plot like this:

where we see that the probabilities approaches an equilibrium in a few $$2N_e$$ generations.

The probability of the two nucleotides being linked is dropping off at the beginning, but at equilibrium it is still much higher than the probability of being unlinked, so we still expect pretty long segments at equilibrium.

The process is not Markov along the nucleotides, in the sense that the probability of three nucleotides being linked is not the product of the probability of two nucleotides being linked $$P(\bullet-\bullet-\bullet)\neq P(\bullet-\bullet)P(\bullet-\bullet)$$.

To see this, we can consider the three nucleotide CTMC.  It has five states: 1) where all three nucleotides are linked, 2) where the first two nucleotides are linked but the third is not, 3) where the first is unlinked from the second and third, who are linked, 4) where all three nucleotides are unlinked, and 5) where the first and third nucleotides are on the same lineage but the second is on a different lineage.

The CTMC for this system has the rate matrix

$$!\mathbf{Q}=\left(\begin{array}{ccccc}- & 2N_e r & 2N_e r & 0 & 0\\ 1 & – & 0 & 2N_e r & 0 \\ 1 & 0 & – & 2N_e r & 0 \\ 0 & 1 & 1 & – & 1\\ 1 & 0 & 0 & 2N_e r & -\end{array}\right)$$

where $$-$$ in all cases should be read as minus the sum of the other entries in the row.

Below I plot, as a function of time, $$P(\bullet-\bullet)P(\bullet-\bullet) / P(\bullet-\bullet-\bullet)$$

if $$P(\bullet-\bullet-\bullet) = P(\bullet-\bullet)P(\bullet-\bullet)$$ we should see a line at 1, but clearly we do not.

If we want to analyse genomes with Markov models, or hidden Markov models, we still need to assume it, though.  Lucky for us, it turns out that assuming the Markov property still allow us to reasonably accuratly calculate the mean segment length as a function of time.

If we assume the Markov property, the segment lengths will be geometrically distributed, so the probability of seeing a segment of length $$k$$ is $$P(\bullet-\bullet)^k P(\bullet\ \ \bullet)$$ and the mean length will be $$\frac{1}{P(\bullet\ \ \bullet)}$$.  In the plot below, the red line shows the mean segment length under this assumption on top of the simulation results I showed earlier.

As you can see, the CTMC predicted mean is pretty close to the mean under the coalescence process.

Consequences of this, when comparing two species

If we compare the genomes of two species we have to keep this process in mind.  The coalescence with recombination process is going on in the individual species lineages back in time, to their ancestral species.

Even if we start out with only two sequences, one from each species, they will be distributed on a number of different segments at the time they reach the ancestral species.  So even if we assume that no further recombination is going on in the ancestral species, we would still see different coalescence times along the sequences, just because the segments that reach the ancestral species will be distributed on different lineages.

The distribution of segment lengths when the two initial sequences make it into the ancestral species will not necessarily be the same.  The coalescence rate was swept under the carpet in the calculations above, because we work in time units of $$2N_e$$ generations, but if the two species have different effective population sizes, the equilibrium distribution of the CTMC will be different for the two species, and so will the mean segment length.

The larger the effective population size, the shorter segments we expect (and the longer it will take the CTMC to approach equilibrium).

The process going on in the ancestral species, when we start out with two genomes, is quite a bit more involved, but that is the topic for my next post on the coalescence process.


  1. Asger Hobolth, Ole F. Christensen, Thomas Mailund, Mikkel H. Schierup (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model PLoS Genetics, 3 (2)
  2. Carsten Wiuf, Jotun Hein (1997). On the number of ancestors to a DNA sequence (PDF) Genetics, 147, 1459-1468


This week in the blogs

Here’s my weekly list of interesting blog posts in the past week.  I don’t know, I found this week a bit of a slim picking, maybe because I didn’t pay good enough attention… anyway, at least there were a few posts I enjoyed.



Math and the web


Next gen sequencing