Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability $$\frac{1}{2N_e}$$, or until it is lost from the population againg, which happens with probability $$1-\frac{1}{2N_2}$$ of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is $$2N_2$$ generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site — where few recombinations will have broken up the sequence — most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost — even with slightly better odds accidents do happen — but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation — unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out — putting your thinking hat on or just simulate — the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren’t “hitchhiking” on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from “neutral” regions — where no sweep has occurred or is occurring — from “selected” regions — where a sweep occurred or is occurring.  So you don’t have to assume that a locus you are looking at is either a neutral region or a selected region and you don’t have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, $$p$$, that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to $$p/2$$).

$$!T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}$$

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn’t really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry…

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan’s and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter $$p$$ so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don’t get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan’s method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5

Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

Leave a Reply