On segment lengths, going back in time, in the coalescence process. Part 2: The ancestry of two species
Friday, February 27th, 2009Previously I wrote about the coalescence process when tracking a single lineage back in time. Today I will write a bit about the process when dealing with two sequences, and the consequences when comparing the genomes of two species.
I will start out with considering the plain old coalescence process, when we consider two sequences back in time, and at the end move to the process when we consider sequences from two different species.
The ancestry of two sequences
In Wiuf and Hein (1999) they considered the coalescence process with recombination of \(k\) sequences, and how it distributes segments on different most recent common ancestors (MRCAs). Here, I will just consider \(k=2\).
If you simulate the coalescence process for two sequences, you will -- just as for the single sequence -- see sequences break up, due to recombination events, and merge again, due to coalescence events.
We can then ask: When do two nucleotides, one for each initial sequence, reach their MRCA? How many MRCAs do we eventually end up with? How many nucleotides share the same MRCA? How long are the contiguous segments on an MRCA? etc.
Below I have plotted the coalescence times, where each nucleotide pair from the two sequences, reach their MRCA. The simulations were run with \(N_e=10000\), sequence length \(L=100K\) and per nucleotide recombination rate \(r=10^{-8}\).
You cannot see this from the plot, but when you see two segments with the same \(y\) value, they are actually sitting on the same MRCA. That doesn't mean that the segments on the same MRCA are contiguous. Just to the left of the middle of the plot you see two long segments with the same MRCA, but with a short segment between them that coalesced slightly earlier.
You might notice that the smaller the coalescence time, the longer the segments. This is generally the case; the longer you go back in time, the greater that chance that a recombination event breaks up a segment. The broken up segments will, of course, coalesce again, but initially the coalescence events are relatively rate because there are few lineages. The recombination events are also rare, per nucleotide, but since we are tracking a lot of nucleotides back in time, they still occur quite frequently.
Once recombination has broken the initial sequences into many fragments, the coalescence rate is higher and we see a lot of recombinations, but by now the fragments are already much shorter.
The long fragments that coalesce early are of course still part of the recombination/coalescence process, but since today we only track them up to their MRCA we don't see that in the plot.
Wiuf and Hein considered, among other things, the number of MRCAs, the expected length between recombination breakpoints (whether they break up segments on the MRCA or not; it is still possible for two segments to re-coalesce and end up on the same contiguous MRCA segment), and the expected lengths of the contiguous segments on the MRCA.
The expected number of segments
The expected number of segments carrying ancestral, \(S_k\) material was already derived by Griffiths and Marjoram in 1997, and is given by:
\[E[S_k(\rho)] = 1 +\rho\left( 1 - \frac{2}{k(k+1)}\right)\]
where \(k\) is the number of initial lineages (two in the simulation above), and \(\rho=4N_e(L-1)r\) is the scaled recombination rate.
This clearly grows both with the number of lineages \(k\) and the recombination rate \(\rho\), but towards an equilibrium for \(k\) and unbounded for \(\rho\).
The expected number of MRCAs
Wiuf and Hein didn't derive an expression for the expected number of MRCAs, but used simulations to compare the expected number of segments, \(E[S_k(\rho)]\) to the expected number of MRCAs. See the figure on the right.
Since segments can share MRCA, it is not surprising that we tend to see fewer MRCAs than segments, but notice that the ratio does not tend towards zero but a constant that is higher for small \(k\) and lower for larger \(k\), with the difference decreasing as \(k\) increases. So the number of initial lineages does not have a strong effect on the number of MRCAs, but the recombination rate clearly does.
When the scaled recombination rate increases (either because the actual recombination rate increases or because we consider longer sequences), the number of MRCAs increases proportionally with the number of segments (and with \(\rho\)).
This might trouble you a bit, since for sufficiently long sequences, you can end up with far more MRCAs than you have \(N_e\). Not that it is a problem that you have more individuals than \(N_e\) -- after all, the effective population size is just a parameter of our model and is usually quite a bit smaller than the actual population size -- but an important assumption in the coalescence process is that you are modelling the ancestry of a sample much smaller than \(N_e\). With recombination, even for a small initial \(k\) you can end up with too many lineages for this assumption to be true.
Not to worry. As we showed in Davies et al. (2007) this doesn't actually affect local properties like segment lengths (but does, of course, have a large effect on number of MRCAs).
The expected length between recombination points and the expected segment length
Wiuf and Hein derived expressions for the expected length between recombination points:
\[E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}\]
and the expected segment lengths at MRCAs:
\[E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}\]
sorry for the reuse of \(r\) here, but this is the notation from Wiuf and Hein (1999).
Don't worry too much about the \(B_k(0)=1\) and \(A_k(0)=1\), that is just their way of saying that they are measuring the length to the next recombination point / break point from the a given recombination point / break point.
Notice that both lengths are independent of the recombination rate. This can't be true, of course, and it isn't. The reason that the expressions are independent of the recombination rate is just that the lengths are measured in units of \(\rho/2\), so really the expressions should be
\[E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}/\frac{\rho}{2}\]
and
\[E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}/\frac{\rho}{2}\]
Below I compare the expected segment lengths, \(E[s_k\,|\,A_k(0)=1]\) (in red) with the mean segment lengths from simulations (in blue).
I am not sure exactly why the fit is so bad for the small recombination rates, but this could be an artefact from the simulation setup since 1) the simulations work with discrete nucleotides while the mathematics works with continuous segments and 2) there is an upper bound to the possible segment lengths in the simulations which means that really long segments are not possible.
The relative error (average-expected)/expected approaches zero as the recombination rate increases.
The ancestry of two species
If we consider two sequences from different species, the situation changes a bit from the above.
If we start out with two sequences in different species, the segments cannot coalesce into MRCAs initially, but still fragment due to the process running within the species branches, as I explained in my previous post.
This means that the two sequences enters the process described above, not as contiguous segments but as several fragments. The number of fragments and the length of fragments depend on how far back in time the speciation event occurred.
Below I plot single simulations of the process where the speciation event is 0, 1, 5, and 10 times \(2N_e\) generations back in time.
In general, we see that the number of fragments increases and the mean fragment length decreases as we push the speciation event back in time, with the effect reaching an equilibrium as the process in the single species approaches equilibrium.
The expression from Wiuf and Hein (1999) will thus tend to overestimate the mean segment length in this setting.
The plots below compare the predicted mean length, using the formula above, and the average segment length for different speciation times:
(the title is wrong on those, the actual speciation times are 0 generations ago, 2Ne generations ago, 6Ne generations ago, and 10Ne generations ago -- but I don't have time to fix that right now as I am late for teaching...)
The comparison is summarised below:
I do not have any expressions for the expected segment lenths in this setting (but I hope that Wiuf can help me work those out).
In my next post in the series, however, I will use a two nucleotide CTMC to approximate some properties about the system, similarly to what I did with the single species case.
References
- Carsten Wiuf, Jotun Hein (1999). The ancestry of a sample of sequences subject to recombination (PDF) Genetics, 151, 1217-1228
- J. L. Davies, F. Simancik, R. Lyngso, T. Mailund, J. Hein (2007). On Recombination-Induced Multiple and Simultaneous Coalescent Events Genetics, 177 (4), 2151-2160 DOI: 10.1534/genetics.107.071126
--
58-77=-19
















