Characterization of missing human genome sequences and copy-number polymorphic insertions

If you are into structural variation and next generation sequencing, this might interest you.

It’s a quick review of the paper

Characterization of missing human genome sequences and copy-number polymorphic insertions

Jeffrey M Kidd et al


The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

We have known for a while that when you sequence a new individual, or re-sequence an individual, you’ll find some sequences that are not in your assembly.  Some of it is because of assembly artifacts which we cannot completely avoid, and then there is the interesting stuff that is simply polymorphism of genomes.  Structural variation.

The paper describes a large study of this based on sequencing where they identify such sequences and characterizes them.

Evolution in Health and Medicine

Hi readers.  Sorry I’ve been very slow in posting the last two months.  My RSI kicked in badly early January and I chose to limit my computer usage to the absolute minimum for a while.  That, combined with a lot of work on various projects means that I haven’t been able to blog since around Christmas…

It will probably be another few months before I’m up to speed again.  I still haven’t recovered fully, but at least it is getting better…

Anyway, enough excuses!  I’m posting now just to share this nice list of talks I got by email today: Evolution in Health and Medicine.

I like the talks there, at least, and I hope you will also.

Stay tuned.  While the posting is at a very low rate right now, I do plan to pick up the speed over the coming weeks…

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability $$\frac{1}{2N_e}$$, or until it is lost from the population againg, which happens with probability $$1-\frac{1}{2N_2}$$ of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is $$2N_2$$ generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site — where few recombinations will have broken up the sequence — most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost — even with slightly better odds accidents do happen — but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation — unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out — putting your thinking hat on or just simulate — the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren’t “hitchhiking” on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from “neutral” regions — where no sweep has occurred or is occurring — from “selected” regions — where a sweep occurred or is occurring.  So you don’t have to assume that a locus you are looking at is either a neutral region or a selected region and you don’t have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, $$p$$, that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to $$p/2$$).

$$!T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}$$

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn’t really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry…

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan’s and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter $$p$$ so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don’t get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan’s method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5

Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032

Detecting ancient admixture and estimating demographic parameters in multiple human populations

I read this paper on our way back from Leipzig and then again today to see if I missed anything in the first read through (I was pretty tired at the time).

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Wall, Lohmueller and Plagnol, Mol Biol Evo 26(8):1823-1827

We analyze patterns of genetic variation in extant human polymorphism data from the National Institute of Environmental Health Sciences single nucleotide polymorphism project to estimate human demographic parameters. We update our previous work by considering a larger data set (more genes and more populations) and by explicitly estimating the amount of putative admixture between modern humans and archaic human groups (e.g., Neandertals, Homo erectus, and Homo floresiensis). We find evidence for this ancient admixture in European, East Asian, and West African samples, suggesting that admixture between diverged hominin groups may be a general feature of recent human evolution.

What they do in this paper is to fit a two population coalescent model, with expansion, migration, bottlenecks and the works, to both an African+European and an African+Asian data set, then use this fitted model as a null model of the genetics of the populations.  They then 1) do a test on an LD statistic against this null model, taking rejections of this null model as evidence for admixture from archaic humans, and 2) fit an admixture extension of the model to estimate the level of admixture.  They find evidence for admixture with archaic humans for both data sets, with a somewhat higher degree in the Europeans.

I’m a bit underwhelmed by the paper, I must admit.  I’m not saying that there is no admixture with archaic humans, but this approach does not convince me.

Even when taking various demographic effects into account in the modeling, the null model is unlikely to exactly fit real data.  Taking deviations from the null model as any kind of evidence for admixture thus seems a bit hasty.

Not that I have any better ideas as to how to approach this, just, in my eyes the jury is still out on the question of admixture with archaic humans…

Wall, J., Lohmueller, K., & Plagnol, V. (2009). Detecting Ancient Admixture and Estimating Demographic Parameters in Multiple Human Populations Molecular Biology and Evolution, 26 (8), 1823-1827 DOI: 10.1093/molbev/msp096

Central European farmers did not descend from local hunter-gatherers

This is a truly great ancient DNA study.  I haven’t actually read the paper I must confess up front, but just read the review at Dienekes’ Anthropology blog.

The researchers analyzed DNA from hunter-gatherer and early farmer burials, and compared those to each other and to the DNA of modern Europeans. They conclude that there is little evidence of a direct genetic link between the hunter-gatherers and the early farmers, and 82 percent of the types of mtDNA found in the hunter-gatherers are relatively rare in central Europeans today.

So it seems that when farming arrived in Europe some 7000-8000 years ago, it wasn’t the local hunter-gatherers that learned farming, but rather farmers bringing their new technology.  Not that they simply invaded and replaced the local people:

We know that people lived in Europe before and after the last big ice age and managed to survive by hunting and gathering. We also know that farming spread into Europe from the Near East over the last 9,000 years, thereby increasing the amount of food that can be produced by as much as 100-fold. But the extent to which modern Europeans are descended from either of those two groups has eluded scientists despite many attempts to answer this question.

Now, a team from Mainz University in Germany, together with researchers from UCL (University College London) and Cambridge, have found that the first farmers in central and northern Europe could not have been the descendents of the hunter-gatherers that came before them. But what is even more surprising, they also found that modern Europeans couldn’t solely be the descendents of either the hunter-gatherer alone, or the first farmers alone, and are unlikely to be a mixture of just those two groups.

The new study confirms what Joachim Burger´s team showed in 2005; that the first farmers were not the direct ancestors of modern European. Burger says “We are still searching for those remaining components of modern European ancestry. European hunter-gatherers and early farmers alone are not enough. But new ancient DNA data from later periods in European prehistory may shed also light on this in the future.”