Widespread genomic signatures of natural selection in hominid evolution
Friday last week, PLoS Genetics published a paper I've been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I'll tell you all about when it comes out).
The PLoS Genetics paper is this:
The reason I've been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.
Effective population size
Okay, the concept "effective population size" is a strange beast. It doesn't really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.
There's a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?
As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population. If we look at it that way, it is clear that selection will affect the effective population size.
A site under selection, if it gets fixed, will do so much faster than a site that is neutral. A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).
If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.
So in any case, if a site is under selection, we expect the mean time back to a common ancestor -- the effective population size -- to be reduced.
To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day...
Recombination has an effect on this as well.
A site under selection will have a smaller effective population size, but so will nearby sites. The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor -- and thus the same divergence -- with this probability depending on the recombination distance between them.
Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.
It is this kind of patter that McVicker et al. analyses in this paper.
First they identify conserved genomic regions. These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.
They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).
They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away. In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla -- with chimp as an outgroup -- and those grouping chimp with gorilla -- with human as an outgroup. The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.
For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.
Since the measures are essentially all just measures of conservation, really, that isn't in itself much of an argument. All it says is that there is a correlation of conservation-ness along the genome. To compensate for this, they then normalise with the divergence to macaque and to dog. If it is just a reduction in substitution rate that is correlated, then normalising this way -- assuming that the substitution rate doesn't change dramatically along the genome and along the phylogeny -- will alleviate the effect from just the substitution rate.
After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.
Again, this doesn't prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it. The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.
(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance , polymorphism-based finescale recombination distance  and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP , whole-genome Perlegen  data, and HapMap phase II data . Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.
Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.
So what does all this tell us?
For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes. Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes. For anything that is not a gene, we usually analyse the sequences assuming neutrality. It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.
I know, I am putting it a bit on an edge here, 'cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.
Another consequence is for sequence divergence.
We estimate species divergence (time of speciation events) from sequence divergence. More often than not we equate sequence divergence with specises diverergence, but really we shouldn't. Even under neutrality this isn't true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.
There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).
With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.
All of this is not really a new insight. People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.
McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471