On segment lengths, going back in time, in the coalescence process. Part 3: Segment lengths of MRCAs of two species

A while ago I wrote the first two posts in this “series” on the coalescent process and how genomic segments behave as a dynamic system back in time.  Read Part 1 (about a single sequence) here and Part 2 (about two sequences) here.  I wanted to write quite a bit more about it, but what happened was that I managed to translate it into a method for population genomics and I was busy working on that and writing a paper about it so I left the blogging for a while.  The paper is now under review, and I have time to write about some of the ideas here again.

As you may recall, we can think of the ancestry of a genomic sequence as a dynamic system that, when run back in time, splits up into different lineages through recombination and then merge again through coalescence events.  This is just the classical coalescence process with recombination, and the result of running it is the so-called Ancestral Recombination Graph, or ARG.

We can use this model both at the population level, but also at the inter-species level (where the ancestries just go back a lot longer in time).

The mathematics is fairly easy to deal with when simulating — it is a relatively simple continuous time Markov chain (CMTC) — but because the state space is rather large (infinite if you consider a continuous segment as the genomic segment), inference is rather difficult as is working out some of the properties of the system.  For example the different properties I wrote about in Part 2.

Markov approximations

In the work we are doing here at BiRC on “ancestral population genomics” we work around this problem by making a simplifying assumption.  We assume that the process is not only a Markov chain in time, but also along the genomic segment.  What this means is that instead of dealing with the enormous — or infinite — state space of the coalescent with recombination, we can consider only neighbouring pairs of nucleotides, work out their dynamics, and extrapolate from there using the Markov assumption.

In Part 1, I described a simple example of this, when considering the dynamics of a single genome going back in time, and described how the Markov assumption is incorrect, but still reasonable in approximating the coalescent process.

In this post I am going to approximate the ancestry of two (haploid) genomes through a CTMC modeling two neighbouring nucleotides from each of the genomes.

The states of the system is shown below:

CTMC statesEssentially the states captures all combinations of a left nucleotide being linked with a right nucleotide (a black line or no line) between the two sequences (top black dots and bottom black dots) and the nucleotides having found a most common recent ancestor (MRCA, the white dots) or not (the black dots).

You can move between the states through recombination events — that remove the links between right and left nucleotides — or coalescence events — that either adds a link between a left and right nucleotide or joins two left of two right nucleotides in their MRCA.

The rate matrix for the system is shown below, where the blank cells are zero entries and the “-” cells contain -1 times the row sums, as usual.

Rate matrix

Here $$C$$ is the coalescent rate and $$R$$ the recombination rate.  Typically, we use a time scale in coalescent theory of $$2 N_e$$ generations per time unit in which case $$C$$ would be 1.  If $$N_e$$ is 10,000 as for humans, then a recombination rate of $$1 cM/Mb$$ would be a rate of $$R=0.004$$ on that time scale.

When considering two genomes back in time, we start in one of the states in $$\Omega_B$$ — the “beginning states”.  If we have two haploid genomes and we start the system at time zero, we would probably start in state 4 where both the genomes are linked, but if we consider an inter-species system we would get some probability distribution over the states in $$\Omega_B$$ from running the single sequence system from Part 1.

The states in $$\Omega_L$$ (or $$\Omega_R$$) corresponds to states where only the left (or only the right) nucleotides have found their MRCA while the right (or left) have not.  The “end states”, $$\Omega_E$$ contains the two states where both left and right nucleotides have found their MRCA, and is essentially just the single sequence system from Part 1.

Running the system back in time, the probability of being in one of these four classes of states evolves as this:

Dynamics of the CTMCBeing in either $$\Omega_L$$ or $$\Omega_R$$ is always somewhat unlikely, since it requires that a recombination has decoupled a left and right nucleotide before finding a MRCA, and it is more likely that the left and right nucleotides find their MRCA in the same coalescent event, moving directly from $$\Omega_B$$ to $$\Omega_E$$.

Deep and shallow coalescent segments

With a finite state CTMC as the one above, we can compute pretty much everything of interest, and there is an extensive theory for how to manipulate these systems.  If we want to say anything about the ancestry of genomes, though, we have to remember that the system only deals with two nucleotides, which in anyone’s book is quite a bit shorter than a genome.

To get from a pair of nucleotides to a full genome, we use the Markov assumption.  When modeling $$n$$ nucleotides, we assume that we can model the ancestry of number $$n$$ conditional on the previous $$n-1$$ by only conditioning on number $$n-1$$, and then we can compute the probabilities of those ancestries using our two-nucleotide CTMC.

For the rest of this post, I’ll give an example of this, and then give other examples in following posts.  When the paper I have under review now is out, I’ll also explain how this CTMC can be used to construct a CoalHMM for population genomics.

Anyway, the example: Let us consider the human-chimp-gorilla trio.  We know that these three species are very closely related, to the point where there is extensive incomplete lineage sorting.  This happens when we have deep coalescences, compared to the speciation times.  If the human and chimp lineages have not found their MRCA before the gorilla speciation (when considering the process back in time), then it is possible that the human coalesces with the gorilla before the chimp, giving us a ((human,gorilla),chimp) genealogy, and similarly for chimp and gorilla coalescing before reaching the human.

Some things about this system are easy to work out from the coalescence process. Working out the probabilities for deep coalescences and incomplete lineage sorting is pretty simple.  The coalescence process has a simple exponential distribution, so figuring out the probability that the human and chimp have not coalesced before the gorilla speciation (as a function of effective population sizes and speciation times) is straightforward.

If the lineages have not coalesced before the gorilla speciation, then there are three possible topologies for the genealogy, ((human,chimp),gorilla), ((human,gorilla),chimp) and ((chimp,gorilla),human), each equally likely, and only one of which, ((human,chimp),gorilla), is congruent with the species tree.

Congruence graphSo, e.g. to work out the probability of seeing a gene tree equal to the species tree, you just need to work out the probability that human and chimp coalesce before or after the speciation time.  The gene tree is congruent if they do, or 1/3 of the times when they do not.

Congruence equationAll in all pretty simple, but it is only telling us how many nucleotides we expect to see congruent with the species tree, not anything about how they are distributed along the genome.

Here, I am going to consider the slightly simpler question of segments above or below the speciation time.  It is straightforward to work out the probability of seeing a nucleotide above $$\Pr(T>\tau)$$ or below $$\Pr(T\leq\tau)$$ the speciation time, but what is the pattern of segments above or below along the genome?

Let’s call a contiguous segment of nucleotides that all coalesce further back than the speciation gorilla time a “deep coalescence segment” and a contiguous segment of nucleotides that all coalesce before the gorilla speciation a “shallow coalescence segment”.

Oh, by the way, before above means closer to the present … it is confusing, I know, but time runs backwards in coalescence theory.

We can use the Markov approximation to work out the mean length of deep and shallow coalescence segments as a function of the speciation time.

Under the Markov approximation, we can argue as follows: For a deep coalescence, consider a left nucleotide in a deep state.  The nucleotide to the right of it is either also deep, or it is shallow.  If it is deep, then the next one can be deep or shallow, and so on.  Under the Markov approximation, the probability of moving from deep to shallow is the same in each such step, and the length of a deep segment is given by the waiting time for moving from deep to shallow, a waiting time that is geometric distributed.

For the CTMC states, we have the left nucleotide in a deep state if we are in one of the states in the classes $$\Omega_L$$ or $$\Omega_B$$ at the speciation time, while we have the left nucleotide in a shallow state if we are in one of the states int he classes $$\Omega_R$$ or $$\Omega_E$$.

If the left nucleotide is in a deep state, then the right is also in a deep state with probability $$\frac{\Pr(\Omega_E)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$ while the right is in a shallow state with probability $$\frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$.  Similarly for when the left nucleotide is in a shallow state.

The mean length of a deep coalescence segment is then $$1 / \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}$$ while the mean length of a shallow coalescence segment is $$ 1 / \frac{\Pr(\Omega_L)}{\Pr(\Omega_L)+\Pr(\Omega_B)}$$.

Very easy to compute with the CTMC, but pretty hard from the coalescence process.

Of course, it is only an approximation, so we should worry a bit about how accurate it is.  I’ve simulated some data sets using the coalescence process, using a human-chimp divergence of 4.5 mya, a present day effective population size of humans and chimps of 10,000 or 50,000 and an effective population size of the human-chimp ancestor of 50,000 or 100,000.  The plots below shows the distribution of deep and shallow segments as the boxplots, the mean length of these as the green bullet, and the expected mean lengths as computed above as the blue line.

10K-50K10K-100K50K-50K50K-100KAll in all, it is a pretty good fit, so the approximation looks okay.

If we assume that the present day Ne of humans and chimps is 20,000 (a bit high for humans and a bit low for chimps, but on average okay) and that the human-chimp ancestral Ne is 50,000 (as often estimated) then with a gorilla branching off 1 milion years before the human-chimp speciation, we expect that segments that coalesce deeper than the speciation are ~116bp while segments that coalesce before are ~76bp.

Of course, this is only on average, and the true picture is more complex, but it does tell us that the genomic relationship between these three species is somewhat complex with a lot of short genomic fragments with MRCA in the human-chimp ancestral species or with MRCA in the shared African ape ancestor.

Running CoalHMM on Xgrid

Our Linux grid at work is busy with the final CoalHMM analyses of the orangutan genome we are running there plus some analyses of some fungi genomes, and starting next week we are going to run the gorilla genome through our pipeline there, so it will be busy for a while.

Since we have a bunch of Macs around the offices, I figured it would be worthwhile to set up those to run CoalHMM there so I can get started on the bonobo genome that we also want to analyse, so I spent most of today setting up Xgrid for that.

That turned out to be slightly harder than I thought, but I finally worked it out, and this is what I did:

Getting the damn program to run

I decided to use GridStuffer for the job.  It is a nice GUI for Xgrid that takes care of most of the job management that I don’t really want to handle manually.

For that, you need a “meta job file” that just contains the command lines you want to run.  Naively I thought I could just call the CoalHMM tool and see what problems I would run into and then try to fix them, but not so.

Doing that, the jobs just keep failing, but you won’t get any output you can use for debugging.  I don’t think this is GridStuffer specific but just how Xgrid handles executables it can’t deal with.

First I thought that it was because it tried to run the tool with the absolute path – like Xgrid would do with other executables if you just give it that – and since CoalHMM isn’t installed on all the machines that will of course not work.  So I tried various combinations of options to get it to copy the executable with it (like -files and -dirs and combinations of that) but nothing worked.  And no output I could use to debug it.

Frustrated with this, I wrote a wrapper script that Xgrid at least should be able to run and produce error messages, and that worked fine, showing me that indeed it was problems with the executable not being found.

Ok, that is what I expected, so I played some more with the -files option but with little success until I copied the executable to the working directory and used a relative path for the -files option.  Finally the program was copied to the agents that should run the job.

Not that it ran, though.  It was missing a number of dynamic libraries.  So I tried copying those as well, but with little success until I put all the executable and the dynamic libraries in a local directory and added an -in option to the command line to run it there.  That finally worked.  I’m not sure if it can be achieved in other ways, but I can live with that.  Having all the executable code in a directory to be copied to the agents isn’t that much of a problem.

Copying the data files

The next problem was copying the data to the agent.

I have a single big file with a whole genome alignment that I want to analyse, but for the analysis we typically break this down into “chunks” of around 1Mbp that we analyse independently.

Obviously I don’t want to copy the entire alignment just to analyse one megabase, and since I don’t have a shared file system for these Macs I had to split the data into the chunks I want analysed.

Since putting all I need into a single directory and using the -in option worked for running the program, I took the same approach here.  I split the genome alignment into the chunks I need, put each job into a separate directory together with the executable and fire it all off with the -in option.

So far it seems to be working, but I have only tired a few chunks.  The script I use for splitting up the data (and generating the meta job file in the same go) takes quite a while to run (and I’m running it only on my desktop computer) so it was still running when I left the office.

I have an exam tomorrow so I won’t get back to it until Wednesday, but by then I should have the data packaged nicely up into chunks and have meta job file for GridStuffer.  Then I’ll fire it off and see how it goes.


CoalHMM: Get the source before your friends

Julien has put the source code for our CoalHMM tool up at GNA.  Consider it a beta release, since there are still lots of issues we need to fix before we can make any kind of promises about it, but at least it is there for you to see should you be interested.

You need the most recent Bio++ code to compile it, so get that from CVS.


Multi-core HMMs

I’m working on a project where it looks like I could end up with hidden Markov models with hundreds if not thousands of states, and with transition matrices where all entries are non-zero.  Since, for such models, the running time for just calculating the likelihood (not even optimising it) is $$O(m\cdot n^2)$$ where $$m$$ is the length of the input and $$n$$ is the number of states, this could be a problem.  Especially since I’m working on whole genome data, so $$m$$ can get pretty large as well.

It is part of our CoalHMM work, but I’ll tell you more about that later.  For this post, the actual application isn’t all that important, only the size of the model.

So, anyway, needless to say I need a fast implementation if this is ever going to work.

HMM algorithms

The basic hidden Markov model algorithms are essentially just filling in dynamic programming tables.

The “Forward” algorithm, for example, looks something like

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        F[i,j] = sum;

There’s a few more details in a real implementation, ’cause this version will have underflows in a few steps, but this is the essential idea.

SIMD speedups

We have developed a small library with these algorithms here at BiRC (mainly implemented by one of my student programmers, Andreas) that speeds up the algorithms using single instruction, multiple data (SIMD) instructions.

The idea here is that we can calculate the sums as above by adding several entries in parallel.  For doubles we can handle two numbers in parallel, for floats we can handle four, so ideally we can speed up the algorithms by a factor of two or four.

We don’t exactly achieve that – there is some overhead involved – but it is not too bad.  The figure below compares the running time with and without SIMD for varying number of states, $$n$$, with a fixed input length $$m=10,000$$.

The number type – float or double – doesn’t matter for the plain algorithm since the same number of operations is used, so I’ve only plotted the time for doubles.

This is all good and well, but with really large HMMs I think we will need even more of a boost.

Multi-core speedups

It is not hard to get your hands on a multi-core machine these days – if anything it is hard to get a single core machine.  Not massively multi-core machines, although I expect to see those in a few years, but at least dual or quad core machines. So I was hoping to get another factor of two or four from running two or four threads.

In general, parallelising the instructions in the program is harder than parallelising the data operations as we do with the SIMD.  For one thing, there is a lot of synchronisation issues when running multiple threads, and this leads to a lot of overhead.

When filling in the dynamic programming table as above, each column in the table depends on the previous column, but each cell in the column is independent from the other cells in the column.  Thus, we should be able to process each cell in a column independently, but synchronise the threads after each column so we ensure that the previous column is always fully processed.

Since the synchronisation isn’t free, we expect that we need a large number of rows in the table to make it worthwhile, but in my project I expect hundreds to thousands of states, so perhaps there is some gain to it.

I decided to try it out, anyway.

Setting up threads and synchronising them and all that used to require some programming effort, but now that we have OpenMP it is very easy to add parallelism to a program.  To parallelise the algorithm above, you simply add a pragma:

for (int i = 0; i < m; ++i) {
#pragma omp parallel for
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        F[i,j] = sum;

and by magic the middle loop is parallelised so the body of it is run in different threads.  Setting up the threads, assigning ranges of the sum to the different threads, synchronising them after the loop, etc. is all taken care of by the compiler.


The results are plotted below (on my dual core iMac):

As is evident, there is a large overhead in the threaded version, so for anything below a few hundred states it is slowing the algorithm down instead of speeding it up, but eventually the parallelism does kick in.

Well, for the version without SIMD at least, which is the one I’ve plotted above.

For double SIMD, it is just worth it for the large number of states

while for floats it isn’t really worthwhile for this range of number of states

Of course, this is just the very first attempt at parallelising the algorithms, so there might still be some tricks to try.  For one thing, I have a feeling I can get rid of some of the thread overhead by adding a pragma outside the outer loop. I recall seeing something like that before but I don’t remember quite how it worked.

Anyway, I’ll play with it some more later.


Widespread genomic signatures of natural selection in hominid evolution

Friday last week, PLoS Genetics published a paper I’ve been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I’ll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I’ve been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept “effective population size” is a strange beast.  It doesn’t really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There’s a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor — the effective population size — to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day…


Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor — and thus the same divergence — with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.


First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla — with chimp as an outgroup — and those grouping chimp with gorilla — with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn’t in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way — assuming that the substitution rate doesn’t change dramatically along the genome and along the phylogeny — will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn’t prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.


So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, ’cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn’t.  Even under neutrality this isn’t true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.

McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471