We have just published a new paper in PLoS Genetics:
A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species
We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.
It is mainly a methods paper; we could have done a lot more interesting stuff with the data analysis but since the main message of the paper, in our view, is the new inference model we really only did proof-of-concept analysis. So don't expect exciting new biological insight (well, there is one thing that's a bit exciting, but I'll get back to that).
It's the modelling that is the main result.
A bit of background for CoalHMMs
We've been working with inferring parameters of speciation - split times, effective population sizes, amount of incomplete lineage sorting - using the sequential Markov coalescence model for several years now. We just haven't called our models SMC models, mainly because we didn't think in terms of actually modelling the coalescence process in our first paper in 2007. There we just had a hidden Markov model and figured out how to get coalescence parameters out of its transition matrix.
We did parameterise the model directly from the coalescence process in 2009 - in the process figuring out that some of the symmetries we thought we needed the first model shouldn't be there, while others should. I still prefer to distinguish between the population genetics model, SMC, and the inference model model, CoalHMM (or SMCHMM maybe, but that is harder to pronounce).
SMC (and SMC' for that matter) assumes that there is at most one recombination between any two neighbouring genealogies. This is mathematically correct in those models since they work with continuous space. No two recombinations can ever hit the same point there.
It is also perfectly sensible to assume that only one recombination occurs between any pair of nucleotides when having discrete space, in most cases, since the recombination rate is very small compared to the coalescence rate.
We were comparing genomes from different species, though, where the divergence between two genomes can be several millions of years, and there it turns out that you introduce a large bias in estimating the recombination rate if you don't take this into account. Although the largest problem with the model from 2009 was not actually multiple recombinations but not allowing "back-coalescence" from a lineage into itself (which is the difference between the SMC and SMC' models).
Not realising that SMC' would probably solved most of the problems we had, and probably also because I didn't want to have to derive the equations needed to use it, we wanted a model that didn't make any assumptions about the number of recombinations and where lineages would coalesce. We of course need the Markov assumption to make any inference on a genome scale feasible, but aside from that we didn't want any other assumptions.
I was playing around with this for several months until, in a discussion with Carsten Wiuf, I had a Eureka moment. Once you accept that you have the Markov assumption, you need only worry about pairs of neighbouring nucleotides, and although the state space of all possible ancestral configurations is large it is not impossible to compute and manipulate.
You can build the state space for ancestral configurations, it is finite if you only have two nucleotides (it is as long as you have discrete space, really), and this is just one big continuous time Markov chain. You can easily compute the probability of any run of this CTMC, and if you discretise time you get a finite number of runs (or equivalence classes of runs, really, but who cares?).
As a side note I should say that if I had know the literature better I should have known that this is the way to do it. Simonsen and Churchill did it already back in 1997 and Slatkin and Pollack something very similar to what we implemented in 2006. I just didn't know about this until long after we published our method. Anyway...
Getting a transition matrix for the HMM from this CTMC is relatively straightforward when you only look at two genomes, and we published a model doing this last year. It is very similar to PSMC in that it uses discrete time intervals as hidden states and makes inference in the coalescence process by building the transition matrix from a model of the coalescence process.
Explicitly modelling the ARG for two nucleotides, however, makes it very easy to model any kind of demographic scenarios. If you can model it with the coalescence process, you can also model it with the coalescence process limited to two nucleotides, and then use this framework to translate that into an HMM for inference.
In theory, at least. There are some practical challenges in automatically building the CTMC from a demographic scenarios - and you definitely do not want to do that by hand! - and in picking out which runs of the CTMC are equivalent and how they are mapped into the transition matrix of the HMM.
None of it is really that complicated in hindsight. There were just a bunch if ideas that were only simple once we worked them out. Like using a coloured Petri net to specify the models or using strongly connected component graphs to get the transition matrix ... none of it is stuff you really brute force your way to the solution, but it is simple enough once you have the idea. We published how we did it in a paper earlier this year.
We've had the framework coded up for almost a year now, but it has taken surprisingly long to bugfix and validate. I'm always surprised at how long it takes from having something that works in theory until you've validated how it works in various scenarios where your assumptions are likely to be violated and we did a lot of that for this paper. You can read all about it in the supplemental material.
An isolation-with-migration CoalHMM
The model we present in this new paper is probably the simplest extension you can make in the framework to the model we published last year.
The model from last year (well, it is a bit older than that really, but it was a companion for the orangutan genome paper and so wasn't published until that paper came out last year) is a simple isolation model. Given one genome from each of two species, it uses the distribution of "time to the most recent common ancestor" (TMRCA) to infer the ancestral effective population size and the time since the speciation.
In the new model, we add to that model a period of gene flow after the initial split in two populations until it eventually stops. This means that we work with three CTMCs in the model. One from the present day back to when gene flow is possible (looking backwards in time, as we do for coalescence models, confusing as that can be), then another where gene flow is possible, and then the final one where we have one single panmictic population.
Connecting these three CTMCs the right way, we get a single (non-homogeneous) CTMC, and from the SCC graph of this we get the transition matrix for the HMM.
Where the model really gets its information for inference is the distribution of TMRCA both in time and in space. The distribution for the TMRCA is very different in an isolation model and an isolation-with-migrath model. For the time distribution Rosenberg and Feldman wrote a very nice review in 2002, and we looked a bit at it in a paper last year. Dutheil and Hobolth (co-authors on most of the CoalHMM papers, including the new one) also discusses it a bit in a chapter from this year. But it is not just the time distribution that is different between the two models, also how the TRMCA is distributed along a sequence alignment, and our model uses both the temporal and spatial distribution in its inference.
Using the model, we can infer when the ancestral population initially split up, how long gene flow still occurred, and how much gene flow there was. There are some symmetries in the model that makes the direction of gene flow non-identifiable, so we only estimate symmetric gene flow, though.
Comparing models with and without gene flow, we can also determine which fits the data best, and in that way determine if there was a clean allopatric split, or if there was a period of gene flow.
We cannot, though, determine if the gene flow was continuous over that time interval, or if it was a series of isolations and admixture events ... whether we can get to that with the framework we have is something we are currently working on. It looks a bit like we can see different levels of gene flow in different time intervals with it, but results are very preliminary still.
Speciation in the Great Apes
As you probably know, there are two species in three of the four great ape species; humans are the only lonely ones. For Pan there are common chimpanzees and bonobos, for the gorillas the eastern and western gorillas are different species, and for orang-utans the Bornean and the Summatran orang-utans are different species.
From earlier studies we already have a good idea about whether those speciations were allopatric or had gene flow. There are a very few results seeing very limited gene flow between bonobos and chimpanzees, but most results show that there was a clean split, which is also the result we got in the bonobo genome paper. For the orang-utans we showed in the orang-utan genome paper we saw gene flow between the two orang-utans, and for the gorillas we saw the same in the gorilla genome paper.
So we sort of knew what to expect and used these three speciations as a proof-of-concept analysis, and got the results we expected there: for bonobos and chimpanzees, the clean split model is preferred, and for the other two the migration model is preferred.
Our estimates of the initial population split and end of gene flow are also in good concordance with estimates from the literature, where we have them.
One thing that might be a bit odd is the timing of the bonobo-chimpanzee split. Even using a lower mutation rate than we normally would (for very good reasons), the split is recent compared to the formation of the Congo River (from what I can find out about that, it seems very hard to figure out). The formation of the Congo River has been the argument for why that particular split was instant rather than having a long period of gene flow, but that doesn't quite fit what we see - or what earlier estimates have seen either, for that matter.
The split between humans and chimpanzees
Not surprisingly we were asked in reviews to also look at the human and chimpanzee split. As a race, we are quite self absorbed so that split is of course more interesting than what was going on in the other great apes...
Looking at this split, our gene flow model is preferred over the clean split, and we actually see quite a long period with limited gene flow.
Now whether that surprises you or not, I don't know, but it doesn't seem that far fetched to me, considering that we have seen gene flow in most of the other great ape splits and we now know that there were a at least two or three admixture events between modern humans and archaic humans.
Statistically, the simpler model should be the null model, but my prior is now preferring gene flow over a clean split.
Admixture between protohumans and protochimpanzees was proposed by Patterson et al. in their "complex speciation" scenario in 2006. Most papers that have looked at this scenario since have preferred a clean split, and the evidence for the complex speciation scenario can be explained in other ways. The large variation in TMRCA can be explained just by a large effective population size and the difference between X and autosomes by sperm competition or just selection being different on X as we think it is in central chimpanzees.
We cannot really see if the gene flow we estimate between chimpanzees and humans is any kind of complex scenario with long isolation followed by admixture - we need to model and test that explicitly - but we do see gene flow, and the time between the initial split and the end of gene flow aren't far from the initial split and then the admixture proposed by Patterson et al.