Posts Tagged ‘speciation’

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Wednesday, August 26th, 2009

A few days ago I wrote about the hypothesis of complex speciation between humans and chimps, and today I'll briefly discuss another paper on the human / chimp speciation:

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Matthew T. Webster, Gene 443 70-75, 2009

Abstract

There is a large variation in divergence times across genomic regions between human and chimpanzee. It has been suggested that this could partly result from selection against ancestral gene flow between incipient species in regions of the genome containing genetic incompatibilities. It is possible that such barriers to gene flow could arise in specific genes or in chromosomal inversions. I analysed patterns of lineage sorting that occur between human, chimpanzee and gorilla genomic sequences by examining divergent site patterns in > 18 Mb genomic alignments. I develop a method to normalise site patterns by the mutational spectrum to minimise errors caused by misinference caused by recurrent mutation. Here I show that divergence times appear to be uniform between coding and noncoding sequences and between inverted and non-rearranged portions of chromosomes. I therefore find no evidence to support the large-scale accumulation of genetic incompatibilities at speciation genes or chromosomal inversions in the ancestral population of humans and chimpanzees. In addition, site patterns that are discordant with the species tree occur more frequently in regions with high human recombination rates. This could indicate the action of selective sweeps in the ancestral population, but could also be indicative of increased rates of homoplasy in these regions. I argue that these observations are compatible with a neutral allopatric model of speciation.

Models of speciation

Speciation happens when gene flow stops between one group of a species and another (and doesn't start again later or we get something like the hybridization scenario I wrote about in my earlier post).

There are different ways this can happen.  For instance, one group might somehow find itself geographically isolated from the other - e.g. find themselves on the other side of a large river - effectively isolating the group from the rest of the species.  This is know as allopatric speciation (or depending on exactly how this plays out, peripatric speciation).

In this scenario, the speciation happens at the time where the groups are isolated.  From that point and onwards the groups are essentially different species, since gene flow has stopped.  It will take some time before the groups are incapable if inter-breeding, but unless they actually merge again at some time before then, the time of the speciation event is the time the groups get separated.

That doesn't mean that the genomic divergence time between the two species matches the time back to the speciation event.  Some individuals in one of the groups might be closer related to individuals in the second group than the other individuals in the first group for a few generations.  So the genetic distance between the two species is a bit larger than the "species distance".  Add in recombination and the picture gets a bit more complex.

Still, we can talk about a specific point in time where the speciation time occurred and we have a mathematical model - the coalescent model - of the genome distance between the two species that depends on this time and the population genetics in the ancestral species before then.

The speciation can also be caused by "genetic isolation".

If a new mutation enters the group, where homozygotes for either the wildtype or the mutants are fitter than the heterozygotes, then the group will tend to split into two.  The mutants and the wildtypes.

Without recombination, there wouldn't be much difference in the genomic distance between the two resulting species.  The heterozygotes would be selected against and the two homozygotes would diverge.

With recombination, again the situation gets a bit more complicated.  The heterozygotes would still be selected against, but assuming heterozygoes still manage to mate from time to time, you would get homozygote offsprings of heterozygoes who are just as fit as other homozygotes.

Because there is selection against heterozygoes you will tend to split the species into two - the two homozygoes - but the divergence will be deeper at the locus of the mutation than it will in the rest of the genome.

We call such a locus a "speciation gene" and candidates for such genes are functional genes (where we expect some selection) or structural variations such as inversions.

Back to the paper...

What Webster looks at in this paper is the patterns of divergence - especially deep coalescence events with incomplete lineage sorting where we observe sites grouping human and gorilla or chimp and gorilla - in the genome.

He then looks at these patterns in genes, introns, inversions ... the candiates for speciation genes, to see if these looks like they are more divergent than the rest of the genome.  If so, then the speciation between humans and chimps could be caused by speciation genes.  If not, then the speciation could be allopatric (the same "species divergence" throughout the genome, but of course not the exact same sequence divergence since the coalescence times will still vary along the genome).

Long story short, he doesn't find any evidence for deeper divergence these places so we cannot rule out an allopatric speciation here.

He does find a correlation between recombination rate and deep divergence, which can be explained by either increased mutability in regions of high recombination or selective sweeps in the ancestral species.  The latter is much more interesting, really, but we cannot rule out the first explanation so I won't comment much on this here...

Critisism

I do have a slight problem with the analysis in the paper, though.

It seems to me that by just looking at differences in divergence time between genes and the rest of the genome - or between inversions and the rest of the genome or whatnot - is not particularly powerful for detecting speciation genes.

When comparing general groups like this, it seems to me that a few speciation genes would simply be drowned out by the larger number of "plain old genes".  So all the analysis is really saying is that there isn't a large number of speciation genes between humans and chimps, not that there are none.

The paper doesn't claim any more than this either, but it would be interesting to work out just how large a fraction of the genes would have to be speciation genes - and how large a difference between the divergence of speciation genes and the rest of the genome there has to be - to be able to distinguish between the two scenaria with this analysis.

I haven't done the math yet, but I plan to when I get the time...

--
Webster, M. (2009). Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation Gene, 443 (1-2), 70-75 DOI: 10.1016/j.gene.2009.05.006
238-243=-5

I just don't know enough paleontology

Wednesday, August 26th, 2009

I just read this post by John Hawks this morning over my morning coffee.  I totally agree with this sentence:

Many years ago, I got used to the fact that paleontologists and geneticists live in separate realities.

and I find this quite disturbing.

I'm one of the geneticists trying to figure out the ancestry of apes and trying to date the speciation events, and I just cannot read the paleontology papers.  Well, I can read them, but I really don't understand them, so I often end up just scanning for estimates of speciation times without being able to judge how they come about.

Just last week I tried to figure out the divergence time between humans and orangutans to relate it to the estimates we get in the orangutan genome project.

For example, following a reference from another paper I read this one that, according to the first paper was supposed to give a lower bound on the speciation of 18 million years ago.  First I just scanned the PDF for "18" but the units where "18" appear are mm so not exactly what I was looking for.  So I tried actually understanding the paper... I probably failed, 'cause as far as I understand it it gives an upper bound of 20 million years ago.

Scanning the supplemental information of the first paper I then found that they use the 18 mya both as an upper and a lower bound, depending on which table you look at, and that just makes it that more confusing.

As a side remark, here I agree with John Hawks again:

After quoting from their online supplement (once again, grumbling that the essential details are hidden online where nobody reads them!)

I hope that it is an upper bound, since a lower bound would be very inconsistent with our genetic estimate, but I just wish I could be sure I understood the paper...

--

238-242=-4

Widespread genomic signatures of natural selection in hominid evolution

Tuesday, May 12th, 2009

Friday last week, PLoS Genetics published a paper I've been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I'll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I've been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept "effective population size" is a strange beast.  It doesn't really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There's a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor -- the effective population size -- to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day...

Recombination

Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor -- and thus the same divergence -- with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.

Results

First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla -- with chimp as an outgroup -- and those grouping chimp with gorilla -- with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn't in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way -- assuming that the substitution rate doesn't change dramatically along the genome and along the phylogeny -- will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn't prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.

Consequences

So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, 'cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn't.  Even under neutrality this isn't true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.

--
McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471

132-145=-13

On gene trees and species trees

Thursday, February 12th, 2009

Last week I reviewed a paper on inferring species trees based on gene trees, and I so wanted to write about it here, but of course I have to patiently wait until the paper is published.

However, today there appeared an application note in Bioinformatics (advanced access) on the topic -- and there was another application note a few months back -- so this gives me an excuse to write a few words about speciation trees and gene trees.

The relationship between gene trees and species trees is one of my own research interests, although not the inferrence of the trees.  In our CoalHMM work (Hobolth et al 2007), we use the relationship between gene trees to infer information about the speciation events.  Much more on that on a later day, though.

Species trees and gene trees

When you think about phylogenetic inference, you typically think about the relationship between species in a tree.  So, for instance, the relationship between human, chimp, and gorilla would group human and chimp together and have gorilla as an outgroup.

This is the relationship between the species, but it is not the whole story.  There is population genetics going on within the branches of this tree, which we can model as a coalescence process.  This is a generalisation of the Wright-Fisher process that is mathematically easier to work with, but for the points I will make here it might be easier to think of the Wright-Fisher process.

The Wright-Fisher process is a very simple mathematical model of the evolution of a population.  It says that we have a set of discrete non-overlapping generations, where each new generation is sampled from the previous by sampling at random with replacement.  So you start out with a set of of N individuals in the first generation and then you create the next generation by N times selecting a parent from the first population at random, and copy him to the next generation.

For the next generation you do the same, but this time you sample from the second generation (the one you just created)...

...and you continue this process for as many generations as you need.

This is how the process runs within a population.

When you have a speciation event, parts of the population branches off the other part -- for some reason or other -- and you can sample individuals in the two separate species only from individuals in the same species.

An example with two speciation events is shown below:

This process, running inside the species tree, has two consequences: DNA divergence times do not correspond to speciation times, and the toplogies for the "individuals" do not necessarily correspond to the species topology.

The first is obvious when you think about it.  The speciation even is the most recent time after which no individuals in two separate species can sample from the same individuals in the previous generation, so but that does not mean that when you consider the most recent ancestor of two individuals in separate species, that that ancestor is found exactly at the speciation event.  It can be much more ancient than that.

If you know the speciation time, say from the fossil record, you do not necessarily know the divergence time of the DNA.  Conversely, if you use the molecular clock to date the split between two species, you are not dating the actual speciation time but the DNA divergence time; the speciation time is likely to be more recent.

That the toplogy can be different than the species tree can be seen if you consider two speciation events close in time.  Consider two "individuals", one from each of the two closest related species.  These can have a most recent common ancestor in their shared common ancestor in the time between the first and the second speciation event

or they can have a most recent common ancestor further back in time than the first speciation event, in which case an "indivdual" from the third species might share a common ancestor with one of them more recent.

Just to avoid confusion, when I say "individual" I don't actually mean individual (which is why I quote the first).  There are no present day humans more related to chimps than others -- although you sometimes get that impression.

The time since the speciation event is such that all humans (or chimps or gorillas) will share common ancestors much more recent than the speciation events.

The process involves recombinations, however, so if we trace a single individual's genealogy back in time, the nucleotides will split apart and join up again in a stocastic process,

and at the time of the speciation event they will be distributed on a number of different chromosomes ("individuals")

and it is these DNA chunks that can end up having different topologies than the species topology.

Different segments of the genome will have different divergence times and possibly different toplogies.

When we talk about gene trees (in contrast to species trees), we are talking about the trees for the individual segments of our genome, and when they differ significantly from the species tree (in either branch lengths or topology) inferring the species tree can be problematic.

Inferring species trees and gene trees

The two applications that I used as an excuse for writing this post concerns inferring species trees from gene trees, or jointly with gene trees.  Both takes statistical approaches; one Bayesian the other Maximum Likelihood.

The first method, BEST (Liu 2008) jointly estimates gene trees and the species tree from alignments.  The idea is that the species tree puts constraints on the coalescence times of the gene trees (they must be compatible with the species tree, so two species in a gene tree do not join up more recent than the speciation event, and the distribution of the tree is given by the underlying coalescence process) and conversely the gene trees put constraints on the species tree (the same constraint about coalescence times) so you can sample one tree when keeping the other fixed, and then use an MCMC framework to sample over trees.

This way you can sample over the posterior probability of both species trees and gene trees.  The process is somewhat time consuming, so probably not practical for genome wide analysis, but nice in its (relative) simplicity nonetheless.

The other tool, STEM (Kubatko et al. 2009) takes a set of gene trees as input and estimates the species tree in a Maximul Likelihood approach.  Again this is done by considering the constraints that the gene trees put on the species tree (together with the underlying coalescence process, of course).

One weakness in both method is the assumption that the gene trees correspond to true underlying coalescence trees.  This is unlikely to be true for real gene trees for two main reasons:  First, the gene trees are inferred and therefore can be incorrect, and second, in a coalescence process with recombination (the process where incomplete lineage sorting occur) it is unlikely that recombination events only occur between and not within the regions used to infer the gene trees.

The first problem, that the gene trees can be incorrectly inferred, is less of a problem for BEST, since it jointly infers the trees, so sampling an incorrect tree from time to time can be corrected through the MCMC run.  I could imagine it being more of a problem for STEM.

The second problem, I think, is a major problem for both.  There are two "sub-issues" here.  One, they assume that there is no recombination within a gene, and second, that different genes are independent (essentially have enough recombination between them that they are in linkage equilibrium).

If you only consider genes far apart, the second assumption is probably not much of a problem, but it does mean that the method cannot scale to whole genome analysis, even if it was computationally feasible, since you cannot have genes close to each other without them being at least slightly correlated.

The first issue is more serious, I think.  If you consider a DNA segment long enough that you can reliably infer its genealogy, it is unlikely that there are no recombinations within that segment, and those are as likely to give you different coalescence times and different topologies as the recombinations between the genes.

The problem with that is, that if you infer a single topology for a region that really have more, you are unlikely to recover any meaningful genealogy.

I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment.  That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn't know.

I plan to look into it, though, when I get the time... which won't be any time soon, unfortunately, since I am pretty swamped in other projects right now.

Citations

  1. L. Liu (2008). BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24 (21), 2542-2543 DOI: 10.1093/bioinformatics/btn484
  2. L. S. Kubatko, B. C. Carstens, L. L. Knowles (2009). STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence Bioinformatics DOI: 10.1093/bioinformatics/btp079

--

43-65=-22

How do you calibrate the molecular clock?

Thursday, May 29th, 2008

How do you calibrate the molecular clock -- where you need a few known sequence divergence times -- when you only know a few speciation times?

Yesterday at a meeting (I'm not sure I can tell you which meeting; I'm not sure how open it is supposed to be :-/) we discussed the divergence time of human-orangutan and human-macaque. We need the sequence divergence time to calibrate a CoalHMM model for figuring out some speciation and population genetics parameters of ancestral species.

No definitive answer came up at the meeting, but there was a short discussion by email after the meeting. This paper was sent around, where the divergence times were estimated to 25MYA and 13MYA, respectively, although the last of those numbers is actually the calibration point used in the analysis, so it is an assumption more than an estimate.

The problem is, the 13MYA used for the calibration is based on fossil evidence, and as far as I can see, that would make it an estimate for the speciation time between human and orangutan. We need the sequence divergence time. Speciation time and divergence time can vary with millions of years (if the effective population size is large enough).

If 13MYA is the divergence time between human and orangutan, we get a speciation time that is unrealistically recent.  If the divergence time is 18MYA instead, as we assumed in this paper, we would get a speciation time around 12MYA which would match the MBE paper.

But how do you figure out the divergence time needed to calibrate the clock?  Is there any way to get it, rather than the speciation time, from fossil evidence?

For our purposes, I suppose we can just as well work with speciation times for our calibration, but not everyone is using CoalHMMs for their analysis, so how do you deal with this problem?