Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability $$\frac{1}{2N_e}$$, or until it is lost from the population againg, which happens with probability $$1-\frac{1}{2N_2}$$ of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is $$2N_2$$ generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site — where few recombinations will have broken up the sequence — most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost — even with slightly better odds accidents do happen — but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation — unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out — putting your thinking hat on or just simulate — the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren’t “hitchhiking” on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from “neutral” regions — where no sweep has occurred or is occurring — from “selected” regions — where a sweep occurred or is occurring.  So you don’t have to assume that a locus you are looking at is either a neutral region or a selected region and you don’t have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, $$p$$, that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to $$p/2$$).

$$!T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}$$

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn’t really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry…

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan’s and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter $$p$$ so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don’t get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan’s method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5

Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032

Last week in the blogs

Biology and evolution



Math and physics


Research life

Scientific publishing


Not exactly an impressive success rate…

From my own experience I know that it can be hard to get access to data that you would really love to analyse, but I didn’t expect it to be quite this bad, even for data that is required to be available by the journals where the papers describing the data are published:

Empirical study of data sharing by authors publishing in PLoS journals

Savage and Vickers, PLoS ONE 2009


Many journals now require authors share their data with other investigators, either by depositing the data in a public repository or making it freely available upon request. These policies are explicit, but remain largely untested. We sought to determine how well authors comply with such policies by requesting data from authors who had published in one of two journals with clear data sharing policies.

Methods and Findings

We requested data from ten investigators who had published in either PLoS Medicine or PLoS Clinical Trials. All responses were carefully documented. In the event that we were refused data, we reminded authors of the journal’s data sharing guidelines. If we did not receive a response to our initial request, a second request was made. Following the ten requests for raw data, three investigators did not respond, four authors responded and refused to share their data, two email addresses were no longer valid, and one author requested further details. A reminder of PLoS’s explicit requirement that authors share data did not change the reply from the four authors who initially refused. Only one author sent an original data set.


We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

Getting a 10% success rate, when it should be 100% is pretty bad…


Detecting ancient admixture and estimating demographic parameters in multiple human populations

I read this paper on our way back from Leipzig and then again today to see if I missed anything in the first read through (I was pretty tired at the time).

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Wall, Lohmueller and Plagnol, Mol Biol Evo 26(8):1823-1827

We analyze patterns of genetic variation in extant human polymorphism data from the National Institute of Environmental Health Sciences single nucleotide polymorphism project to estimate human demographic parameters. We update our previous work by considering a larger data set (more genes and more populations) and by explicitly estimating the amount of putative admixture between modern humans and archaic human groups (e.g., Neandertals, Homo erectus, and Homo floresiensis). We find evidence for this ancient admixture in European, East Asian, and West African samples, suggesting that admixture between diverged hominin groups may be a general feature of recent human evolution.

What they do in this paper is to fit a two population coalescent model, with expansion, migration, bottlenecks and the works, to both an African+European and an African+Asian data set, then use this fitted model as a null model of the genetics of the populations.  They then 1) do a test on an LD statistic against this null model, taking rejections of this null model as evidence for admixture from archaic humans, and 2) fit an admixture extension of the model to estimate the level of admixture.  They find evidence for admixture with archaic humans for both data sets, with a somewhat higher degree in the Europeans.

I’m a bit underwhelmed by the paper, I must admit.  I’m not saying that there is no admixture with archaic humans, but this approach does not convince me.

Even when taking various demographic effects into account in the modeling, the null model is unlikely to exactly fit real data.  Taking deviations from the null model as any kind of evidence for admixture thus seems a bit hasty.

Not that I have any better ideas as to how to approach this, just, in my eyes the jury is still out on the question of admixture with archaic humans…

Wall, J., Lohmueller, K., & Plagnol, V. (2009). Detecting Ancient Admixture and Estimating Demographic Parameters in Multiple Human Populations Molecular Biology and Evolution, 26 (8), 1823-1827 DOI: 10.1093/molbev/msp096