On segment lengths, going back in time, in the coalescence process. Part 3: Segment lengths of MRCAs of two species

A while ago I wrote the first two posts in this “series” on the coalescent process and how genomic segments behave as a dynamic system back in time.  Read Part 1 (about a single sequence) here and Part 2 (about two sequences) here.  I wanted to write quite a bit more about it, but what happened was that I managed to translate it into a method for population genomics and I was busy working on that and writing a paper about it so I left the blogging for a while.  The paper is now under review, and I have time to write about some of the ideas here again.

As you may recall, we can think of the ancestry of a genomic sequence as a dynamic system that, when run back in time, splits up into different lineages through recombination and then merge again through coalescence events.  This is just the classical coalescence process with recombination, and the result of running it is the so-called Ancestral Recombination Graph, or ARG.

We can use this model both at the population level, but also at the inter-species level (where the ancestries just go back a lot longer in time).

The mathematics is fairly easy to deal with when simulating — it is a relatively simple continuous time Markov chain (CMTC) — but because the state space is rather large (infinite if you consider a continuous segment as the genomic segment), inference is rather difficult as is working out some of the properties of the system.  For example the different properties I wrote about in Part 2.

Markov approximations

In the work we are doing here at BiRC on “ancestral population genomics” we work around this problem by making a simplifying assumption.  We assume that the process is not only a Markov chain in time, but also along the genomic segment.  What this means is that instead of dealing with the enormous — or infinite — state space of the coalescent with recombination, we can consider only neighbouring pairs of nucleotides, work out their dynamics, and extrapolate from there using the Markov assumption.

In Part 1, I described a simple example of this, when considering the dynamics of a single genome going back in time, and described how the Markov assumption is incorrect, but still reasonable in approximating the coalescent process.

In this post I am going to approximate the ancestry of two (haploid) genomes through a CTMC modeling two neighbouring nucleotides from each of the genomes.

The states of the system is shown below:

CTMC statesEssentially the states captures all combinations of a left nucleotide being linked with a right nucleotide (a black line or no line) between the two sequences (top black dots and bottom black dots) and the nucleotides having found a most common recent ancestor (MRCA, the white dots) or not (the black dots).

You can move between the states through recombination events — that remove the links between right and left nucleotides — or coalescence events — that either adds a link between a left and right nucleotide or joins two left of two right nucleotides in their MRCA.

The rate matrix for the system is shown below, where the blank cells are zero entries and the “-” cells contain -1 times the row sums, as usual.

Rate matrix

Here $$C$$ is the coalescent rate and $$R$$ the recombination rate.  Typically, we use a time scale in coalescent theory of $$2 N_e$$ generations per time unit in which case $$C$$ would be 1.  If $$N_e$$ is 10,000 as for humans, then a recombination rate of $$1 cM/Mb$$ would be a rate of $$R=0.004$$ on that time scale.

When considering two genomes back in time, we start in one of the states in $$\Omega_B$$ — the “beginning states”.  If we have two haploid genomes and we start the system at time zero, we would probably start in state 4 where both the genomes are linked, but if we consider an inter-species system we would get some probability distribution over the states in $$\Omega_B$$ from running the single sequence system from Part 1.

The states in $$\Omega_L$$ (or $$\Omega_R$$) corresponds to states where only the left (or only the right) nucleotides have found their MRCA while the right (or left) have not.  The “end states”, $$\Omega_E$$ contains the two states where both left and right nucleotides have found their MRCA, and is essentially just the single sequence system from Part 1.

Running the system back in time, the probability of being in one of these four classes of states evolves as this:

Dynamics of the CTMCBeing in either $$\Omega_L$$ or $$\Omega_R$$ is always somewhat unlikely, since it requires that a recombination has decoupled a left and right nucleotide before finding a MRCA, and it is more likely that the left and right nucleotides find their MRCA in the same coalescent event, moving directly from $$\Omega_B$$ to $$\Omega_E$$.

Deep and shallow coalescent segments

With a finite state CTMC as the one above, we can compute pretty much everything of interest, and there is an extensive theory for how to manipulate these systems.  If we want to say anything about the ancestry of genomes, though, we have to remember that the system only deals with two nucleotides, which in anyone’s book is quite a bit shorter than a genome.

To get from a pair of nucleotides to a full genome, we use the Markov assumption.  When modeling $$n$$ nucleotides, we assume that we can model the ancestry of number $$n$$ conditional on the previous $$n-1$$ by only conditioning on number $$n-1$$, and then we can compute the probabilities of those ancestries using our two-nucleotide CTMC.

For the rest of this post, I’ll give an example of this, and then give other examples in following posts.  When the paper I have under review now is out, I’ll also explain how this CTMC can be used to construct a CoalHMM for population genomics.

Anyway, the example: Let us consider the human-chimp-gorilla trio.  We know that these three species are very closely related, to the point where there is extensive incomplete lineage sorting.  This happens when we have deep coalescences, compared to the speciation times.  If the human and chimp lineages have not found their MRCA before the gorilla speciation (when considering the process back in time), then it is possible that the human coalesces with the gorilla before the chimp, giving us a ((human,gorilla),chimp) genealogy, and similarly for chimp and gorilla coalescing before reaching the human.

Some things about this system are easy to work out from the coalescence process. Working out the probabilities for deep coalescences and incomplete lineage sorting is pretty simple.  The coalescence process has a simple exponential distribution, so figuring out the probability that the human and chimp have not coalesced before the gorilla speciation (as a function of effective population sizes and speciation times) is straightforward.

If the lineages have not coalesced before the gorilla speciation, then there are three possible topologies for the genealogy, ((human,chimp),gorilla), ((human,gorilla),chimp) and ((chimp,gorilla),human), each equally likely, and only one of which, ((human,chimp),gorilla), is congruent with the species tree.

Congruence graphSo, e.g. to work out the probability of seeing a gene tree equal to the species tree, you just need to work out the probability that human and chimp coalesce before or after the speciation time.  The gene tree is congruent if they do, or 1/3 of the times when they do not.

Congruence equationAll in all pretty simple, but it is only telling us how many nucleotides we expect to see congruent with the species tree, not anything about how they are distributed along the genome.

Here, I am going to consider the slightly simpler question of segments above or below the speciation time.  It is straightforward to work out the probability of seeing a nucleotide above $$\Pr(T>\tau)$$ or below $$\Pr(T\leq\tau)$$ the speciation time, but what is the pattern of segments above or below along the genome?

Let’s call a contiguous segment of nucleotides that all coalesce further back than the speciation gorilla time a “deep coalescence segment” and a contiguous segment of nucleotides that all coalesce before the gorilla speciation a “shallow coalescence segment”.

Oh, by the way, before above means closer to the present … it is confusing, I know, but time runs backwards in coalescence theory.

We can use the Markov approximation to work out the mean length of deep and shallow coalescence segments as a function of the speciation time.

Under the Markov approximation, we can argue as follows: For a deep coalescence, consider a left nucleotide in a deep state.  The nucleotide to the right of it is either also deep, or it is shallow.  If it is deep, then the next one can be deep or shallow, and so on.  Under the Markov approximation, the probability of moving from deep to shallow is the same in each such step, and the length of a deep segment is given by the waiting time for moving from deep to shallow, a waiting time that is geometric distributed.

For the CTMC states, we have the left nucleotide in a deep state if we are in one of the states in the classes $$\Omega_L$$ or $$\Omega_B$$ at the speciation time, while we have the left nucleotide in a shallow state if we are in one of the states int he classes $$\Omega_R$$ or $$\Omega_E$$.

If the left nucleotide is in a deep state, then the right is also in a deep state with probability $$\frac{\Pr(\Omega_E)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$ while the right is in a shallow state with probability $$\frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$.  Similarly for when the left nucleotide is in a shallow state.

The mean length of a deep coalescence segment is then $$1 / \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$ while the mean length of a shallow coalescence segment is $$ 1 / \frac{\Pr(\Omega_L)}{\Pr(\Omega_L)+\Pr(\Omega_B)}$$.

Very easy to compute with the CTMC, but pretty hard from the coalescence process.

Of course, it is only an approximation, so we should worry a bit about how accurate it is.  I’ve simulated some data sets using the coalescence process, using a human-chimp divergence of 4.5 mya, a present day effective population size of humans and chimps of 10,000 or 50,000 and an effective population size of the human-chimp ancestor of 50,000 or 100,000.  The plots below shows the distribution of deep and shallow segments as the boxplots, the mean length of these as the green bullet, and the expected mean lengths as computed above as the blue line.

10K-50K10K-100K50K-50K50K-100KAll in all, it is a pretty good fit, so the approximation looks okay.

If we assume that the present day Ne of humans and chimps is 20,000 (a bit high for humans and a bit low for chimps, but on average okay) and that the human-chimp ancestral Ne is 50,000 (as often estimated) then with a gorilla branching off 1 milion years before the human-chimp speciation, we expect that segments that coalesce deeper than the speciation are ~116bp while segments that coalesce before are ~76bp.

Of course, this is only on average, and the true picture is more complex, but it does tell us that the genomic relationship between these three species is somewhat complex with a lot of short genomic fragments with MRCA in the human-chimp ancestral species or with MRCA in the shared African ape ancestor.

Ok, I’m not gonna make it…

I set out to post an average of one post per day this year, and I was on track until a month ago.  After that, I pretty much just stopped blogging.  The number of projects I was working on pretty much got out of hand and I was too stressed and it has taken me the Christmas holiday to recover.  Still, I’m so far behind on the blogging, more than 20 behind with this post, that I doubt that I’ll make it for the new year resolution, but at least I am planning on getting back to blogging again now…

Some feedback would be nice…

I just got two research grant applications rejected this week.

Nothing wrong with that, really, the success rate where I applied is less than 10% so I couldn’t really expect to get the grants.  Still, it annoys me a little bit that the letter I got back is just boiler plate.  “We receive more applications than we can fund, and unfortunately we could not fund yours.”

That doesn’t tell me if I was close to getting funded or lightyears from it.  It certainly doesn’t help me improve on the application if I want to try again.

Some feedback on the applications would really be nice.  If not a full reviewer report, then at least a score or something…

Oh well, I still have one grant application under review, this one for the EU Research Council, and here at least I’ve already gotten a review report back, with scores and everything, and it looks really nice, so I have my fingers crossed…