Archive for the ‘Research’ Category

Efficiently handling largish genetic data sets

Sunday, April 7th, 2013

Okay, so I'm working on this dataset containing called variants from about 20 sequenced animals from two species and maybe a pair of hybrids. It's not an enormous data set, but there is about 12 million markers and when I load it into R without being clever about it, it doesn't fit into RAM on my desktop machine.

That tells me two things: 1) I need to get a new desktop computer (hell, the laptop I'm writing this blogpost on has more memory than my desktop computer) and 2) I need to be more clever in how I manage data (because the next data set I have in my pipeline has hundreds of individuals from several species and a lot more variable sites; upgrading the computer will not save me there).

There is nothing of what I'm doing right now that I can't simply do by reading line by line through a VCF file, and that is how I've solved my problems so far. It is just pretty time consuming and I was hoping I could come up with something that would be easier and faster.

Speed is important up to a point. It would be nice to be able to work with the data interactively, but as long as I can handle my scans in a few hours on our computer cluster it is fast enough. Ease of use is more important, because while I can deal with waiting a day for a computation to finish I find it much harder to set aside a day for programming the scripts for the computation.

With R data.frames I find it pretty easy to pull out most information for an easy computation. With the foreach and plyr packages it is even easier. It isn't that hard doing it in Python either, although there is usually a bit more IO code for parsing up files and writing the output to files so I can load it into R afterward.

I know there are packages for handling large data sets in R, but since I would need to come up with a new solution for what I'm doing anyway I thought I might try out a few things and maybe come up with a solution that is easy to work with from both Python and R (since I often use both to manipulate the data I'm analysing).

Using an SQL database for genetic data

The data I have might be large when I have to keep it in RAM but it isn't large compared to most databases, so I figured I could try an SQL solution. SQLite would be my preferred solution since it doesn't require a server (and I don't have access to my server from my computing grid) but I have played around with both SQLite and MySQL this weekend.

I'm pretty sure I'm doing something wrong, though, 'cause my code is running very slow. Slower than if I just scanned through flat text files at least.

I've set up the database like this (here in MySQL syntax):

-- Drop the tables, we assume we are starting from scratch
DROP TABLE IF EXISTS DogDerivedCounts;
DROP TABLE IF EXISTS Genotypes;
DROP TABLE IF EXISTS Dog;
DROP TABLE IF EXISTS Markers;

-- Create the three main tables, reading the data in from files
CREATE TABLE Markers (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  scaffold TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  alt_allele TEXT NOT NULL,
  qual_score FLOAT NOT NULL)
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Markers.txt' INTO TABLE Markers;

CREATE TABLE Dog (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  chromosome TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Dog.txt' INTO TABLE Dog;

CREATE TABLE Genotypes (
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  ref_read_count INT,
  alt_read_count INT,
  called_ref_count INT,
  call_quality NUMERIC,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Genotypes.txt' INTO TABLE Genotypes;

-- Now create the orientation using dog as the ancestral allele...

-- Table of derived allele counts when oriented with respect to dog
CREATE TABLE DogDerivedCounts(
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  derived_count INTEGER,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;

-- Insert those where the reference allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, Genotypes.called_ref_count
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.ref_allele = Dog.ref_allele);

-- Insert those where the alternative allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, (2 - Genotypes.called_ref_count)
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.alt_allele = Dog.ref_allele);

The idea is that I have a table with marker information, another table where I store out group information (here a Dog, but I have other out groups I treat the same way), and then a table that provides the genotype for each individual. The latter might be the problem for my code, because of course it is harder to pick out an individual from a long list of all genotypes rather than scan through all the genotypes for that one individual, so maybe I'm better off having a genotypes table per individual, I don't know...

Anyway, now the hope was that I could write SQL expressions that would be easy to write and could be reused in both Python and R.

SELECT expressions are a lot like list comprehension in Python and R's foreach and they are pretty easy to write. Joins give me an easy way of zipping lists together to combine information. It costs me a lot in running time, though, because I have no experience in optimising SQL.

To compute ABBA/BABA patterns for D statistics and similar I wrote the code below that first filters genotypes based on read depth and call quality and then count the site patterns.

DROP TABLE IF EXISTS tmp1;
DROP TABLE IF EXISTS tmp2;
DROP TABLE IF EXISTS tmp3;

-- Filter the derived counts based on read depth and quality threshold
CREATE TEMPORARY TABLE tmp1 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "PB7"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp2 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "ABC1"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp3 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "GRZ"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

-- Now collect the ABBA/BABA sites by counting the site patterns
SELECT DISTINCT Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count,
(2 - tmp1.derived_count) * tmp2.derived_count * tmp3.derived_count AS ABBA,
tmp1.derived_count * (2 - tmp2.derived_count) * tmp3.derived_count AS BABA,
COUNT(*)
FROM tmp1
INNER JOIN tmp2 ON (tmp1.marker_id = tmp2.marker_id)
INNER JOIN tmp3 ON (tmp1.marker_id = tmp3.marker_id)
INNER JOIN Dog  ON (tmp1.marker_id = Dog.marker_id)
GROUP BY Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count;

Having to build three temporary tables with the filtered information might actually be a stupid idea and the reason it is so slow, but I don't know...

Alternatives

I feel that I should be able to make an efficient solution with SQL, if only I knew more about database design, but perhaps I am misguided. This data has the added structure that I know that I always have information about every marker (ok, it might be missing, but then I can store it as such) and if I just keep it in lists in the right order I should be able to pull out summaries a lot faster than having to join the data together.

I have some ideas for how to code that up as a little library based on lists of the same length and queries based on list comprehension, but there must be solutions like that out there that I just don't know about.

A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species

Friday, December 21st, 2012

We have just published a new paper in PLoS Genetics:


A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species

We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.

It is mainly a methods paper; we could have done a lot more interesting stuff with the data analysis but since the main message of the paper, in our view, is the new inference model we really only did proof-of-concept analysis. So don't expect exciting new biological insight (well, there is one thing that's a bit exciting, but I'll get back to that).

It's the modelling that is the main result.

A bit of background for CoalHMMs

We've been working with inferring parameters of speciation - split times, effective population sizes, amount of incomplete lineage sorting - using the sequential Markov coalescence model for several years now. We just haven't called our models SMC models, mainly because we didn't think in terms of actually modelling the coalescence process in our first paper in 2007. There we just had a hidden Markov model and figured out how to get coalescence parameters out of its transition matrix.

We did parameterise the model directly from the coalescence process in 2009 - in the process figuring out that some of the symmetries we thought we needed the first model shouldn't be there, while others should. I still prefer to distinguish between the population genetics model, SMC, and the inference model model, CoalHMM (or SMCHMM maybe, but that is harder to pronounce).

SMC (and SMC' for that matter) assumes that there is at most one recombination between any two neighbouring genealogies.  This is mathematically correct in those models since they work with continuous space. No two recombinations can ever hit the same point there.

It is also perfectly sensible to assume that only one recombination occurs between any pair of nucleotides when having discrete space, in most cases, since the recombination rate is very small compared to the coalescence rate.

We were comparing genomes from different species, though, where the divergence between two genomes can be several millions of years, and there it turns out that you introduce a large bias in estimating the recombination rate if you don't take this into account. Although the largest problem with the model from 2009 was not actually multiple recombinations but not allowing "back-coalescence" from a lineage into itself (which is the difference between the SMC and SMC' models).

Not realising that SMC' would probably solved most of the problems we had, and probably also because I didn't want to have to derive the equations needed to use it, we wanted a model that didn't make any assumptions about the number of recombinations and where lineages would coalesce. We of course need the Markov assumption to make any inference on a genome scale feasible, but aside from that we didn't want any other assumptions.

I was playing around with this for several months until, in a discussion with Carsten Wiuf, I had a Eureka moment. Once you accept that you have the Markov assumption, you need only worry about pairs of neighbouring nucleotides, and although the state space of all possible ancestral configurations is large it is not impossible to compute and manipulate.

You can build the state space for ancestral configurations, it is finite if you only have two nucleotides (it is as long as you have discrete space, really), and this is just one big continuous time Markov chain.  You can easily compute the probability of any run of this CTMC, and if you discretise time you get a finite number of runs (or equivalence classes of runs, really, but who cares?).

As a side note I should say that if I had know the literature better I should have known that this is the way to do it. Simonsen and Churchill did it already back in 1997 and Slatkin and Pollack something very similar to what we implemented in 2006. I just didn't know about this until long after we published our method. Anyway...

Getting a transition matrix for the HMM from this CTMC is relatively straightforward when you only look at two genomes, and we published a model doing this last year. It is very similar to PSMC in that it uses discrete time intervals as hidden states and makes inference in the coalescence process by building the transition matrix from a model of the coalescence process.

Explicitly modelling the ARG for two nucleotides, however, makes it very easy to model any kind of demographic scenarios. If you can model it with the coalescence process, you can also model it with the coalescence process limited to two nucleotides, and then use this framework to translate that into an HMM for inference.

In theory, at least. There are some practical challenges in automatically building the CTMC from a demographic scenarios - and you definitely do not want to do that by hand! - and in picking out which runs of the CTMC are equivalent and how they are mapped into the transition matrix of the HMM.

None of it is really that complicated in hindsight. There were just a bunch if ideas that were only simple once we worked them out. Like using a coloured Petri net to specify the models or using strongly connected component graphs to get the transition matrix ... none of it is stuff you really brute force your way to the solution, but it is simple enough once you have the idea.  We published how we did it in a paper earlier this year.

We've had the framework coded up for almost a year now, but it has taken surprisingly long to bugfix and validate. I'm always surprised at how long it takes from having something that works in theory until you've validated how it works in various scenarios where your assumptions are likely to be violated and we did a lot of that for this paper. You can read all about it in the supplemental material.

An isolation-with-migration CoalHMM

The model we present in this new paper is probably the simplest extension you can make in the framework to the model we published last year.

The model from last year (well, it is a bit older than that really, but it was a companion for the orangutan genome paper and so wasn't published until that paper came out last year) is a simple isolation model.  Given one genome from each of two species, it uses the distribution of "time to the most recent common ancestor" (TMRCA) to infer the ancestral effective population size and the time since the speciation.

In the new model, we add to that model a period of gene flow after the initial split in two populations until it eventually stops. This means that we work with three CTMCs in the model. One from the present day back to when gene flow is possible (looking backwards in time, as we do for coalescence models, confusing as that can be), then another where gene flow is possible, and then the final one where we have one single panmictic population.

Connecting these three CTMCs the right way, we get a single (non-homogeneous) CTMC, and from the SCC graph of this we get the transition matrix for the HMM.

Where the model really gets its information for inference is the distribution of TMRCA both in time and in space.  The distribution for the TMRCA is very different in an isolation model and an isolation-with-migrath model. For the time distribution Rosenberg and Feldman wrote a very nice review in 2002, and we looked a bit at it in a paper last year. Dutheil and Hobolth (co-authors on most of the CoalHMM papers, including the new one) also discusses it a bit in a chapter from this year. But it is not just the time distribution that is different between the two models, also how the TRMCA is distributed along a sequence alignment, and our model uses both the temporal and spatial distribution in its inference.

Using the model, we can infer when the ancestral population initially split up, how long gene flow still occurred, and how much gene flow there was. There are some symmetries in the model that makes the direction of gene flow non-identifiable, so we only estimate symmetric gene flow, though.

Comparing models with and without gene flow, we can also determine which fits the data best, and in that way determine if there was a clean allopatric split, or if there was a period of gene flow.

We cannot, though, determine if the gene flow was continuous over that time interval, or if it was a series of isolations and admixture events ... whether we can get to that with the framework we have is something we are currently working on. It looks a bit like we can see different levels of gene flow in different time intervals with it, but results are very preliminary still.

Speciation in the Great Apes

As you probably know, there are two species in three of the four great ape species; humans are the only lonely ones.  For Pan there are common chimpanzees and bonobos, for the gorillas the eastern and western gorillas are different species, and for orang-utans the Bornean and the Summatran orang-utans are different species.

From earlier studies we already have a good idea about whether those speciations were allopatric or had gene flow. There are a very few results seeing very limited gene flow between bonobos and chimpanzees, but most results show that there was a clean split, which is also the result we got in the bonobo genome paper. For the orang-utans we showed in the orang-utan genome paper we saw gene flow between the two orang-utans, and for the gorillas we saw the same in the gorilla genome paper.

So we sort of knew what to expect and used these three speciations as a proof-of-concept analysis, and got the results we expected there: for bonobos and chimpanzees, the clean split model is preferred, and for the other two the migration model is preferred.

Our estimates of the initial population split and end of gene flow are also in good concordance with estimates from the literature, where we have them.

One thing that might be a bit odd is the timing of the bonobo-chimpanzee split. Even using a lower mutation rate than we normally would (for very good reasons), the split is recent compared to the formation of the Congo River (from what I can find out about that, it seems very hard to figure out). The formation of the Congo River has been the argument for why that particular split was instant rather than having a long period of gene flow, but that doesn't quite fit what we see - or what earlier estimates have seen either, for that matter.

The split between humans and chimpanzees

Not surprisingly we were asked in reviews to also look at the human and chimpanzee split. As a race, we are quite self absorbed so that split is of course more interesting than what was going on in the other great apes...

Looking at this split, our gene flow model is preferred over the clean split, and we actually see quite a long period with limited gene flow.

Now whether that surprises you or not, I don't know, but it doesn't seem that far fetched to me, considering that we have seen gene flow in most of the other great ape splits and we now know that there were a at least two or three admixture events between modern humans and archaic humans.

Statistically, the simpler model should be the null model, but my prior is now preferring gene flow over a clean split.

Admixture between protohumans and protochimpanzees was proposed by Patterson et al. in their "complex speciation" scenario in 2006. Most papers that have looked at this scenario since have preferred a clean split, and the evidence for the complex speciation scenario can be explained in other ways.  The large variation in TMRCA can be explained just by a large effective population size and the difference between X and autosomes by sperm competition or just selection being different on X as we think it is in central chimpanzees.

We cannot really see if the gene flow we estimate between chimpanzees and humans is any kind of complex scenario with long isolation followed by admixture - we need to model and test that explicitly - but we do see gene flow, and the time between the initial split and the end of gene flow aren't far from the initial split and then the admixture proposed by Patterson et al.


On the radio about orangutans

Thursday, March 10th, 2011

I'm working on a text book chapter and am a few weeks past the deadline, so I haven't had time to blog lately. I have a few things I'd like to write about as soon as I am done with the chapter, but in the mean time here's a podcast radio interview with me and Mikkel Schierup (in Danish, though, sorry about that):

Chimp research

Friday, February 4th, 2011

Sorry for linking to a lot of Danish sites, but this story is also kind of cool: Dansk forskning skal redde chimpansen.

Its about genetics of chimps, and a project that I am involved in. We have sequenced the exomes of 30 chimps (eastern, central and western chimps) and we hope to submit out first paper in a few weeks.

It's a really cool project, but unfortunately I cannot say more about it until it is out...

CoalHMM analysis of the human/chimpanzee ancestor, based on the orangutan genome

Thursday, February 3rd, 2011

I've been wanting to write about our paper on the orangutan genome for a while, but I've just been too busy so far, so a little late I finally get to it.

Besides the Nature paper, where we contributed to the analysis of the two sub-species of orangutans, we have two companion papers. One is already out in "early access" at Genome Research and the other will be out later in PLoS Genetics. Since the latter paper is not out yet, this post will be about the Genome Research paper.

Coalescent in an isolation model

Since all our work is based on coalescent theory and in particular CoalHMMs, I'll start there.

Imagine we have two species, and we sample a gene in each. We can then ask, what is the divergence between the two genes? This divergence will be determined by 1) the divergence of the two species, let's call that T, and 2) the coalescence time between the two genes within the ancestral species, let's call that C.

The species divergence we assume is fixed for all genes, so while it is unknown it is not a stochastic variable. The coalescence time, however, is stochastic, and from coalescence theory we expect it to be exponentially distributed with a rate determined by the effective population size in the ancestral species.

We call this setup an isolation model, and we will use the distribution of divergence times to make inference about the speciation time and the effective population size in the ancestral species.

The figure below illustrates the setup.

If C is exponentially distributed, and the divergence is given by D=C+T, then we can make inference about both parameters as follows: We sample a number of independent genes and get their divergence time. For the exponential distribution, the mean is equal to the standard deviation, so looking at the standard deviation of the divergences we can get the parameter for the exponential distribution. That gives us the mean value of C, and if we then look at D-E[C] we get an estimate for T.

Below is an example of this, where I've estimated the coalescence rate and divergence time from 50 divergence samples.

Complications

This is all very simple, but there are a few problems.

First, you don't really get independent samples of the divergence time between two species. If you sample n individuals from the first species and m from the second, the n in the first species will all have found a common ancestor before that lineage reach the ancestral species, and the same goes for the m samples in the other species. So no matter how many individuals you look at, you end up with a sample of two in the ancestral species. I've written about this before here.

It is not a show-stopper, though, since genes in different parts of the genome are close enough to independent. So if you sample different loci instead of different individuals, you get your independent samples. So while adding more individuals won't help, having an entire genome to look at gives you plenty of samples.

The second problem is that we cannot actually get samples of the divergence time. You cannot look at two pieces of DNA and from that get their divergence. You need to estimate it. It isn't really that hard, since you can get a good estimate from the number of differences between the two sequences. That is, if the entire alignment of sequences have the same divergence time.

If there is a recombination somewhere in the sequences, they do not have the same divergence time, and you cannot estimate the divergence.

You can get around this by looking at short DNA segments, where you expect few if any recombinations. You won't get a good estimate of the divergence then, but you can maybe alleviate this by having a lot of genes (but estimating the coalescence rate based on a standard deviation that have contributions from both the coalescence process and the estimation problems is, well, problematic).

You'd also have to throw most of your data away if you are looking at short segments scattered along the genome (and you cannot have them too close to each other, because then they will no longer be independent).

The CoalHMM approach

The models we develop to deal with this are based on hidden Markov models.

Using these models, we can estimate the divergence time for single nucleotides. Normally you cannot, since they are either equal or difference, and that doesn't tell you much about their divergence (is it zero for equal and infinity for different?). We can do this, because the flanking DNA contains information about this, whether recombinations have occurred or not, and we can capture this information through the Markov model.

It is a rough approximation to the coalescence process, but as far as we can tell, it works reasonably well.

We are getting pretty close to being able to estimate the distribution of divergence times using hidden Markov models, but the model we use is the one that will be published in PLoS Genetics soon and not the model we used in the Genome Research paper, so I'll wait a bit with describing how that works.

The model we used in the Genome Research paper is the one described in this paper.

In this model, we do not attempt to estimate the actual divergence times, but instead use something called incomplete lineage sorting. The idea here is, that if we have a third species closely related to the other two, then sometimes the two sister species have such deep divergence times, that one of them can end up being closer related to the third species than its sister species.

This leaves a stronger signal in the DNA and is thus easier to model and make inference about.

The model based on this needs only four states: one state where the two sister species coalesce early, and three states with deep divergence. If the divergence is deep, the topology of relationships between the species should be uniform -- each topology is seen with one third probability -- and how often we see deep divergences is given by the two speciation times together with the effective population size of the ancestor of the sister species.

As we scan along a genome alignment, we can infer how often we see recent divergences and how often we see deep divergences, and how the deep divergences are distributed along the three topologies.

Below is a figure that Julien made for illustrating this.

With this model, you don't extract as much information from the genomes as you would if you could estimate the divergence times, but with full genomes to work with, you have plenty of information to get good estimates.

You need three closely related species to work with, though.

Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection

And now, finally, we get to the paper.

Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection
Asger Hobolth, Julien Y. Dutheil, John Hawks, Mikkel H. Schierup and Thomas Mailund

Abstract

We search the complete orangutan genome for regions where humans are more closely related to orangutans than to chimpanzees due to incomplete lineage sorting (ILS) in the ancestor of human and chimpanzees. The search uses our recently developed coalescent HMM framework. We find ILS present in ~1% of the genome, and that the ancestral species of human and chimpanzees never experienced a severe population bottleneck. The existence of ILS is validated with simulations, site pattern analysis, and analysis of rare genomic events. The existence of ILS allows us to disentangle the time of isolation of humans and orangutans (the speciation time) from the genetic divergence time, and we find speciation to be as recent as 9-13 mya (contingent on the calibration point). The analyses provide further support for a recent speciation of human and chimpanzee at ~4 mya and a diverse ancestor of human and chimpanzee with an effective population size of ~50,000 individuals. Posterior decoding infers ILS for each nucleotide in the genome and we use this to deduce patterns of selection in the ancestral species. We demonstrate the effect of background selection in the common ancestor of humans and chimpanzees. In agreement with predictions from population genetics, ILS found to be reduced in exons and gene dense regions when we control for confounding factors such as GC content and recombination rate. Finally, we find the broad scale recombination rate to be conserved through the complete ape phylogeny.

In this paper we used humans, chimpanzees and orangutans.

The first question to ask is then, are these three species close enough that we see incomplete lineage sorting?

Without it, we don't have the signal in the data that we need for the model.

Based on previous estimates of the species divergence times and ancestral effective population size of humans and chimpanzees we could work out that some was expected. So that is a good start. To make sure, though, we used some simpler approaches. We looked at indels to check if there would be signals in these supporting clustering of human and orangutan or chimp and orangutan and found that. We also looked at the distribution of alignment columns and again found some signals for alternative topologies of the three species. So with that checked, we applied the model.

From the model we estimate three things: 1) The speciation times for humans and chimps, and from the African apes and orangutan, 2) the effective population size of the ancestral species, and 3) in which regions of the genome humans and chimps, humans and orangutan, and chimp and orangutan are closest related.

I won't say much about number two. The effective population size is a weird parameter that can be affected by so many things, that it is really hard to interpret, and right now we just don't know what really is important, so I'd rather not make any claims (but I'll say a few things about local effective population sizes towards the end of the post).

Number one is interesting because it tells us something about when humans diverged from the other two apes. Our estimates are measured in the number of substitutions since the divergence, but assuming a molecular clock and assuming we have a good estimate of the rate we can get an estimate in years.

Assuming a rate of around 1 substitution per nucleotide per billion years -- an estimate based on several earlier papers that get this number from calibrations with the fossil record -- we get a human/chimp speciation around 4-4.5 million years ago, and a human/orangutan speciation around 11-13 million years ago.

I really don't know how reasonable this is, in relation to the fossil record, so this is when we got John Hawks involved. I have my fingers crossed that he will blog about this at some point.

There are good reasons to be a bit skeptical, though. From recent studies, we know that the substitution rate is lower in humans today, and if that is also true in the past, the estimates should be moved further back in time. We cannot get too far back, though, without running into inconsistencies in the deeper past, but how this will all play out once we do more analysis I cannot say yet. It is something we look into for the gorilla genome (and I'll just leave that as a cliff hanger for now, I'll get back to it when we have published that genome).

For number three, I don't really know. You might be surprised that we are sometimes closer related to the orangutan than the chimpanzee, or you might not. It depends on your prior assumptions, I guess.

We didn't really find anything cool correlated to the patterns of relatedness, so we don't have much of a story to tell about this.

Ancestral selection

The final thing we looked at in the paper was correlations between incomplete lineage sorting and gene density.

Why this is interesting gets a bit technical but has to do with the effective population size.  As I mentioned above, it is a bit of a weird parameter, but one that is affected by selection. If you have a selective sweep the genetic diversity is reduced, and you see this as a reduction in the effective population size. The same effect is seen with purifying selection, where again the genetic diversity is reduced and so is the effective population size.

Incomplete lineage sorting is positively correlated with the effective population size, so if you observe a correlation between incomplete lineage sorting and gene density, it is a signal for selection.

We observe this, and take it as a signal that selection rather than just drift has been a major player in the evolution of our genome.

How much of a surprise this is depends on your prior assumptions again, I guess, but it does indicate that neutrality may not always be the obvious null model for genome analysis.

It is a pretty weak signal for this, though, in this analysis. We see so little incomplete lineage sorting for these three species that it is really hard to analyse it in detail.

When we get human, chimp and gorilla, there is a lot more incomplete lineage sorting, and we can do a lot more. We are seeing some cool signals there, but I'll let that be the second cliff hanger for the gorilla genome paper.

--
Hobolth, A., Dutheil, J., Hawks, J., Schierup, M., & Mailund, T. (2011). Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection Genome Research DOI: 10.1101/gr.114751.110