Posts Tagged ‘Selection’

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Wednesday, September 30th, 2009

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability \(\frac{1}{2N_e}\), or until it is lost from the population againg, which happens with probability \(1-\frac{1}{2N_2}\) of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is \(2N_2\) generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site -- where few recombinations will have broken up the sequence -- most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost -- even with slightly better odds accidents do happen -- but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation -- unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out -- putting your thinking hat on or just simulate -- the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren't "hitchhiking" on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from "neutral" regions -- where no sweep has occurred or is occurring -- from "selected" regions -- where a sweep occurred or is occurring.  So you don't have to assume that a locus you are looking at is either a neutral region or a selected region and you don't have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, \(p\), that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to \(p/2\)).

\[T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}\]

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn't really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry...

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan's and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter \(p\) so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don't get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan's method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5

--
Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032
273-307=-34

Widespread genomic signatures of natural selection in hominid evolution

Tuesday, May 12th, 2009

Friday last week, PLoS Genetics published a paper I've been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I'll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I've been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept "effective population size" is a strange beast.  It doesn't really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There's a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor -- the effective population size -- to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day...

Recombination

Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor -- and thus the same divergence -- with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.

Results

First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla -- with chimp as an outgroup -- and those grouping chimp with gorilla -- with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn't in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way -- assuming that the substitution rate doesn't change dramatically along the genome and along the phylogeny -- will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn't prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.

Consequences

So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, 'cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn't.  Even under neutrality this isn't true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.

--
McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471

132-145=-13

How much selection is going on in humans?

Saturday, January 17th, 2009

A priori we expect that most mutations, by far, have no consequence on fitness, while some have a negative effect and very few have a positive effect.  Consequently, we can generally ignore selection when analysing genomic sequences.

However, over the last few years a number of papers have suggested that adaptive (positive) selection has played a major role in shaping the human genome. That is, genome-wide there are signals that shows patterns of selection. So perhaps we shouldn't be so quick to ignore it.

Yesterday in PLoS Genetics there is another paper arguing this:

Pervasive Hitchhiking at coding and regulatory sites in humans

Cai et al. PLoS Genetics 5(1)

Abstract

Much effort and interest have focused on assessing the importance of natural selection, particularly positive natural selection, in shaping the human genome. Although scans for positive selection have identified candidate loci that may be associated with positive selection in humans, such scans do not indicate whether adaptation is frequent in general in humans. Studies based on the reasoning of the MacDonald–Kreitman test, which, in principle, can be used to evaluate the extent of positive selection, suggested that adaptation is detectable in the human genome but that it is less common than in Drosophila or Escherichia coli. Both positive and purifying natural selection at functional sites should affect levels and patterns of polymorphism at linked nonfunctional sites. Here, we search for these effects by analyzing patterns of neutral polymorphism in humans in relation to the rates of recombination, functional density, and functional divergence with chimpanzees. We find that the levels of neutral polymorphism are lower in the regions of lower recombination and in the regions of higher functional density or divergence. These correlations persist after controlling for the variation in GC content, density of simple repeats, selective constraint, mutation rate, and depth of sequencing coverage. We argue that these results are most plausibly explained by the effects of natural selection at functional sites—either recurrent selective sweeps or background selection—on the levels of linked neutral polymorphism. Natural selection at both coding and regulatory sites appears to affect linked neutral polymorphism, reducing neutral polymorphism by 6% genome-wide and by 11% in the gene-rich half of the human genome. These findings suggest that the effects of natural selection at linked sites cannot be ignored in the study of neutral human polymorphism.

Selection and variation

Neutral mutations are expected to behave differently from non-neutral mutations mainly in their chance to get fixed and the time it takes them to get fixed in a population.  Neutral mutations that gets fixed are expected to have taken a number of generations linear in the effective population size, while mutants under selection that gets fixed are expected to have taken a logarithmic number of generations. That goes for both positive and negative selection, but for different reasons.

If we consider a region and assume that there is no recombination going on, and we assume that a new mutation appears here destined to get fixed in the population.  When it gets fixed, all individuals in the population will be descendent from the individual that first carried the mutation.  They will not be identical at the region, though, 'cause new mutations will have accumulated in the time it took the mutation to get fixed.

The amount of variation in the population will depend on how quickly the mutation got fixed.  If it happened very slow, we expect much variation, and if it happened very rapidly, we expect little variation.

That, combined with the expected time to fixation for neutral and selected mutants gives us a pattern to look for that distinguishes between neutral evolution and selection.

Variation, selection and recombination

When recombination is going on, we expect a slightly different pattern.

If there is selection on several mutations in the region, it gets pretty complicated.  At least I haven't managed to quite get my head around the details yet, but I'll refer you to this book: Population Genetics of Multiple Loci by Freddy Christiansen.

I will just assume that there is a single mutation under selection.

In that case, the pattern really is very similar.  We don't expect to see reduced variation in the entire region around the mutation, but instead we expect reduced variation close to the mutation site -- where few recombination events have occurred while the mutation got fixed -- and increased variation up to the neutral level as we move away from the mutation site -- where more and more recombinations have uncoupled the mutant from sites further away.

Selection in humans

It is this kind of patterns they look for in the PLoS Genetics paper.

A consequence of all of the above is that, assuming lots of selection is going on, we expect a positive correlation between variation and recombination sites, and a negative correlation between variation and sites we a priori expect to be functional (like genes).

This is exactly what they find.

There are a few more details to it, of course.  Density of functional sites, recombination and mutation rates are not independent, so we could see exactly the same pattern just from the correlation with neutral mutation rates, so they need to correct for this.

Essentially, though, it is these patterns -- expected assuming selection but not assuming neutrality -- that they find.

--
James J. Cai, J. Michael Macpherson, Guy Sella, Dmitri A. Petrov (2009). Pervasive Hitchhiking at Coding and Regulatory Sites in Humans PLoS Genetics, 5 (1) DOI: 10.1371/journal.pgen.1000336
17-30=-13

Statistical alignment and virus selection paper now online

Monday, July 21st, 2008

The paper I described in a previous post: Investigating selection on viruses: a statistical alignment approach, just got published online today.  Yeah us!

More on adapting to climate

Sunday, March 9th, 2008

In a previous post I mentioned this review of a recent PLoS Genetics paper. I still haven't read the actual paper, I am ashamed to admit, but we will read it for a journal club at BiRC next week. Anyway, this morning I read another interesting review, again at Genetic Future: Climate genes: positive or balancing selection?

I should probably bring it to our journal club.

The points in this review is that a linear relationship between climate and gene frequency would only be linear if we ignored the history of the human diaspora out of Africa. The time where selection has affected the genes varies a lot from South East Asia, where humans got to early, to South America, where humans got to late.

Would this type of selection actually result in a neat linear trend, like that seen for the RAPTOR gene? Well, it might, if the timing was just right, but it's by no means a necessary outcome. There are at least three variables in play here, each of which will have some effect on the current frequency of a positively selected allele: the strength of selection, the starting frequency of the allele in that population, and the amount of time the population has existed in its current environment. For positive selection to result in a clean linear correlation between allele frequency and a climate variable, the latter two factors would have to have had a negligible impact, so that most of the variation is determined by selection intensity.

I think that's pretty unlikely given what we know about human population history: native Americans, for instance, are the descendants of a cold-adapted population living in Siberia that only relatively recently moved down into the warmer climates of central America; selection has not yet had much time to act in these populations. In contrast, humans in Southern Asia have been in their current climate much longer, giving selection more time to do its work. Thus for variants under positive selection, current frequency will be substantially affected by historical contingencies, and the correlation between allele frequency and selective strength will be rough at best.

There's also a reference to Voight et al. 2007, a paper I reviewed a few weeks back, on signals of selection in humans, based on extended haplotypes around genes under positive selection. Apparently, these signals are missing for the climate genes.

There is an alternative to positive selection, that could also explain the association between climate and gene frequency:

You've probably already guessed my hypothesis: at least some of the genes pulled out from this study (and probably the ones with the tightest correlations) have been the targets of balancing selection. Balancing selection could be acting on climate genes in different ways, but in my mind the most likely mechanism is via heterozygote advantage.

...

This model would result in each population reaching a stable allele frequency that is correlated with the local temperature, regardless of its starting frequency and how long the population had been subjected to that particular environment - so long as there has been enough time for the population to reach equilibrium. This scenario is much more likely to result in a linear correlation between allele frequency and climate variables than a simple positive selection model.

Now, what I am wondering is, how strong does the selection have to be for the allele frequency to reach equilibrium in the late arrival populations (say South Americans), and what kind of signals would we look for in the genome to test if balancing, rather than positive, selection is going on?

I really don't know -- I am too new to genetics to even have a clue -- but I bet that this is old stuff in the genetics literature. I'll have to ask around...