Posts Tagged ‘population genetics’

CD/CV and Goldstein

Wednesday, September 17th, 2008

Everyone seems to be talking about this NY Times interview with David B. Goldstein (Gene Sherpas, biomarker-driven mental health, Adaptive Complexity, John Hawks …)

In the proud tradition of blogging, I will add my voice to the noise.

The common disease / common variant hypothesis

The arguments concern association mapping and the so-called Common Disease / Common Variant (CD/CV) hypothesis.  The CD/CV goes like this: a lot of common diseases are late-onset, so we do not expect selection to be strong against the genetic factors underlying them. This, combined with the recent expansion in the human population leads us to expect that a lot of common diseases to be caused by relatively common variants.

If the hypothesis is true, then we should be able to locate these common variants since we can tag all common variants in the genome with relatively few markers, and we can type these using SNP chips.

If the hypothesis is false, then we are screwed. We probably need complete re-sequencing and some heavy duty statistics to get anywhere.

Out of convenience more than anything, people chose to believe the CD/CV to be true, and started projects such as HapMap to map the common variation in the genome.  Based on this map, companies developed chips to tag all variation genome wide, and disease studies used these chips to do genome wide scans.

Goldstein argues:

It takes large, expensive trials with hundreds of patients in different countries to find even common variants behind a disease. Rare variants lie beyond present reach. “It’s an astounding thing,” Dr. Goldstein said, “that we have cracked open the human genome and can look at the entire complement of common genetic variants, and what do we find? Almost nothing. That is absolutely beyond belief.”

If rare variants account for most of the genetic burden of disease, then the idea of decoding everyone’s genome to see to what diseases they are vulnerable to will not work, at least not in the form envisaged. “I don’t believe we should do more and more genomewide association studies for common diseases,” Dr. Goldstein said. Instead, he suggested, the “missing heritability” might be tracked by thoroughly studying the genome of specific patients.

I would say the jury is still out on this one, but it is clear that the CD/CV isn’t as common as it was hyped to be.  We can only explain a small percentage of the heritability of diseases with the variants found so far.  Still, we have discovered more variants that we can replicate within the last year or two than in all the time up to genome wide scans, so writing off genome wide association studies completely is a bit extreme, in my view.

No, CD/CV is not the full story, but some common variants exist, cause we have found them!

The real question is, of course, how much heritability is explained by common variants and how much by rare variants.  Right now, we simply do not know.  The power to detect even common variants is limited, so there might be more out there to find.  On the other hand, it is hard to believe that the vast majority of the heritability is caused by common variants since we still can only explain very little of it, so some rare variants must be involved.

In the coming few years we will probably figure this out, and that is exciting indeed.

Common disease and selection

Now as for variants behind common diseases being selectively (near) neutral — part of why they can be common in the first place — that is an interesting question.

I personally think that selection is playing a larger part in the story of common diseases than we think, and I look forward to learning this story.

Are we seeing common variants because bottlenecks have reduced selection strength so rare variants — otherwise selected against –  have managed to increase in frequency by drift? Are we seeing common variants because they are selected for by some balancing selection? Are they hitch-hiking  on beneficial variants?

We are already hearing about interesting findings in here (Helgason et al. 2007, Blekhman et al. 2008) and we will learn much more in the future.

We live in interesting times indeed, and now is not the time to abandon genome wide association studies.

How do you calibrate the molecular clock?

Thursday, May 29th, 2008

How do you calibrate the molecular clock — where you need a few known sequence divergence times — when you only know a few speciation times?

Yesterday at a meeting (I’m not sure I can tell you which meeting; I’m not sure how open it is supposed to be :-/) we discussed the divergence time of human-orangutan and human-macaque. We need the sequence divergence time to calibrate a CoalHMM model for figuring out some speciation and population genetics parameters of ancestral species.

No definitive answer came up at the meeting, but there was a short discussion by email after the meeting. This paper was sent around, where the divergence times were estimated to 25MYA and 13MYA, respectively, although the last of those numbers is actually the calibration point used in the analysis, so it is an assumption more than an estimate.

The problem is, the 13MYA used for the calibration is based on fossil evidence, and as far as I can see, that would make it an estimate for the speciation time between human and orangutan. We need the sequence divergence time. Speciation time and divergence time can vary with millions of years (if the effective population size is large enough).

If 13MYA is the divergence time between human and orangutan, we get a speciation time that is unrealistically recent.  If the divergence time is 18MYA instead, as we assumed in this paper, we would get a speciation time around 12MYA which would match the MBE paper.

But how do you figure out the divergence time needed to calibrate the clock?  Is there any way to get it, rather than the speciation time, from fossil evidence?

For our purposes, I suppose we can just as well work with speciation times for our calibration, but not everyone is using CoalHMMs for their analysis, so how do you deal with this problem?

More on adapting to climate

Sunday, March 9th, 2008

In a previous post I mentioned this review of a recent PLoS Genetics paper. I still haven’t read the actual paper, I am ashamed to admit, but we will read it for a journal club at BiRC next week. Anyway, this morning I read another interesting review, again at Genetic Future: Climate genes: positive or balancing selection?

I should probably bring it to our journal club.

The points in this review is that a linear relationship between climate and gene frequency would only be linear if we ignored the history of the human diaspora out of Africa. The time where selection has affected the genes varies a lot from South East Asia, where humans got to early, to South America, where humans got to late.

Would this type of selection actually result in a neat linear trend, like that seen for the RAPTOR gene? Well, it might, if the timing was just right, but it’s by no means a necessary outcome. There are at least three variables in play here, each of which will have some effect on the current frequency of a positively selected allele: the strength of selection, the starting frequency of the allele in that population, and the amount of time the population has existed in its current environment. For positive selection to result in a clean linear correlation between allele frequency and a climate variable, the latter two factors would have to have had a negligible impact, so that most of the variation is determined by selection intensity.

I think that’s pretty unlikely given what we know about human population history: native Americans, for instance, are the descendants of a cold-adapted population living in Siberia that only relatively recently moved down into the warmer climates of central America; selection has not yet had much time to act in these populations. In contrast, humans in Southern Asia have been in their current climate much longer, giving selection more time to do its work. Thus for variants under positive selection, current frequency will be substantially affected by historical contingencies, and the correlation between allele frequency and selective strength will be rough at best.

There’s also a reference to Voight et al. 2007, a paper I reviewed a few weeks back, on signals of selection in humans, based on extended haplotypes around genes under positive selection. Apparently, these signals are missing for the climate genes.

There is an alternative to positive selection, that could also explain the association between climate and gene frequency:

You’ve probably already guessed my hypothesis: at least some of the genes pulled out from this study (and probably the ones with the tightest correlations) have been the targets of balancing selection. Balancing selection could be acting on climate genes in different ways, but in my mind the most likely mechanism is via heterozygote advantage.

This model would result in each population reaching a stable allele frequency that is correlated with the local temperature, regardless of its starting frequency and how long the population had been subjected to that particular environment – so long as there has been enough time for the population to reach equilibrium. This scenario is much more likely to result in a linear correlation between allele frequency and climate variables than a simple positive selection model.

Now, what I am wondering is, how strong does the selection have to be for the allele frequency to reach equilibrium in the late arrival populations (say South Americans), and what kind of signals would we look for in the genome to test if balancing, rather than positive, selection is going on?

I really don’t know — I am too new to genetics to even have a clue — but I bet that this is old stuff in the genetics literature. I’ll have to ask around…

Estimating local ancestry

Tuesday, February 26th, 2008

ResearchBlogging.org

When two populations A and B meet and start to mix, the resulting population will — for the first many generations, at least — be a mix of the two original populations. It is not that each individual will belong to one of the original populations and that the mixed population will consist of such “original population” A or B individuals. At least not after a few generations. Instead, each individual will be a mix of population A and B.

For several generations following the merge of populations A and B, mutations will not change the genes in the mixed population much. It takes a long time for mutations to accumulate. Each gene in the population will essentially be unchanged compared to the gene in one (or both) of the ancestor populations. Here, by gene, I simply mean a “chunk” of DNA, not necessarily a functional bit, so don’t read too much into it. In any case, the genes will not change much, but will look like genes from A or from B, where of course genes from A can differ significantly fro genes from B.

Recombination will shuffle the genes from A and B around, however. If a “mainly A” individual mates with a “mainly B” individual, the offspring will inherit both A and B genes in some combination. As you scan along a chromosome, the ancestral population will change back and forth between A genes and B genes.

Just based on samples from the present day chromosomes, can we infer the local ancestry, i.e. which chunks of each chromosome came from A and which came from B? In this months issue of American Journal of Human Genetics, there is a paper that addresses this exact problem:

Estimating Local Ancestry in Admixed Populations
Sankararaman et al.
The American Journal of Human Genetics 82(2) 290-30

Abstract

Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population substructure could produce spurious associations. Population substructure can be caused by the presence of two distinct subpopulations or a single pool of admixed individuals. In this work, we focus on the latter, which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci that are different among different populations and are still associated with a disease. We evaluated current methods for inference of population substructure in such cases and show that they might be quite inaccurate even in relatively simple scenarios. We therefore introduce a new method, LAMP (Local Ancestry in adMixed Populations), which infers the ancestry of each individual at every single-nucleotide polymorphism (SNP). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote. Our empirical results show that LAMP is significantly more accurate and more efficient than existing methods for inferrring locus-specific ancestries, enabling it to handle large-scale datasets. We further show that LAMP can be used to estimate the individual admixture of each individual. Our experimental evaluation indicates that this extension yields a considerably more accurate estimate of individual admixture than state-of-the-art methods such as STRUCTURE or EIGENSTRAT, which are frequently used for the correction of population stratification in association studies

Inferring local ancestry

The method in this paper makes a few simplifying assumptions that makes the method computational efficient to run.

They assume that the samples considered are from a mix of populations that contributed to the sample in known frequencies — e.g. that population A contributed with 80% and B with 20% — a known number of generations ago. The method is not that sensitive to knowing exactly the fractions of populations or the number of generations — and using other methods you can infer these parameters anyway — but assuming that you know these parameters helps in the mathematics of the method.

A more important assumption is that you can split the chromosomes into sliding windows where recombinations do not occur inside the windows. This is obviously not correct, but it helps the method a lot and is not as silly as it sounds.

If you can split the chromosomes into sliding windows of a fixed length and then infer which chromosomes belong to which ancestral population, the inference problem is much easier to solve. If you then slide this window along the chromosomes and assign populations to the chromosomes in each window, each nucleotide will belong to different populations depending on the window considered.

The same nucleotide will belong to different populations depending on the window. Is this a problem? Yes and no.

For the windows that overlap a given nucleotide, the method takes a vote, and the majority decides which population the nucleotide “really” belongs to. That way you get a unique population per nucleotide.

This is a pretty good idea. This way you get the fast computation of the ancestral population inference and on average you assign the right population to the nucleotides.

You will not be able to accurately infer the break-points where the population changes between populations, but for most applications that is not that important in the first place. You want to assign the nucleotide to the right population on average, and this is what you achieve this way.

Relevance for association mapping

The motivation for the paper is association mapping, where you compare the frequency of alleles between cases and controls for a given disease, looking for markers where the frequency is different between cases and controls. Such markers are potential candidates for disease genes: if one allele is more frequent in cases than in controls, maybe it is more frequent because it increases the risk of the disease.

If your samples are from a population that is a mix of different ancestral populations, there is a high risk of biases. There is a bias in the sampling: if the ancestral populations are sampled in different ratios for cases than controls, you will pick up differences in cases and controls just because of that.

There are obvious situations where this can happen. If you sample cancer patients from an expensive clinic (rich white Americans) and the controls from the ER (with a higher ratio of African Americans), for example, you get a different ratio of African decent and European decent individuals in cases and controls.

If you do not correct for this, you are mapping “ancestral population” genes instead of disease genes.

Comparison to CoalHMMs

I didn’t actually read the paper with association mapping in mind — although the problem is extremely relevant for such studies. I do association mapping in Icelanders, so it is not that important for my own work, though.

I read it for a meeting in our “coalescent hidden Markov model” group.

With our CoalHMMs, we try to learn about speciation events. When there is lineage sorting in the speciation — as for example between humans, chimps and gorillas — the nearest neighbour species of a chromosome changes along the chromosome — in some regions humans are closer related to chimps, in other closer related to gorillas.

This setting is different than the population mixing problem. For one thing, we are not dealing with different ancestral populations mixing, but rather populations splitting up to become separate species. Still, scanning along the chromosomes and inferring which phylogeny each nucleotide belongs to is similar to the problem here.


SANKARARAMAN, S., SRIDHAR, S., KIMMEL, G., HALPERIN, E. (2008). Estimating Local Ancestry in Admixed Populations. The American Journal of Human Genetics, 82(2), 290-303. DOI: 10.1016/j.ajhg.2007.09.022

More on worlwide and genomewide variation…

Saturday, February 23rd, 2008

ResearchBlogging.org Just to finish the trilogy — the three papers examining genome wide polymorphism in this weeks Nature and Science — I should mention Li et al.’s Science paper covering essentially the same as the Jakobsson et al. I just reviewed.

Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation

Li et al.

Abstract

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

The results do not differ that much from Jakobsson et al. but the analysis is different.

First, they use a maximum likelihood method to cluster the sampled individuals into K unknown “ancestral clusters” and considered the clustering obtained with different Ks. For increasing Ks, the individuals cluster into smaller and smaller groupings, indicating their relatedness compared to the whole sample.

Once K is high enough (K=7), the populations mainly cluster together, with most populations being derived from the same single cluster but with some populations (Middle Easterns and South/Central Asians) being a mix of the ancestral clusters.

They then construct a maximum likelihood phylogeny for the populations and find that it fits nicely with the Out of Africa model.

Considering haplotype heterozygosity, they observe that heterozygosity decreases with distance from East Africa, similar to what Jakobsson et al. reports.


Li, J.Z., Absher, D.M., Tang, H., Southwick, A.M., Casto, A.M., Ramachandran, S., Cann, H.M., Barsh, G.S., Feldman, M., Cavalli-Sforza, L.L., Myers, R.M. (2008). Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science, 319(5866), 1100-1104. DOI: 10.1126/science.1153717