Posts Tagged ‘population genetics’

Worldwide, genomewide patterns of variation

Saturday, February 23rd, 2008

ResearchBlogging.org

Another interesting paper in Wednesday’s Nature concerns the worldwide patterns of variation by Jakobsson et al. Again I refer to John Hawks’ blog for a human evolution perspective. Wired also has a nice discussion of the results (together with the Lohmueller et al. paper I just reviewed and a Science paper that I haven’t read yet).

Genotype, haplotype and copy-number variation in worldwide human populations

Jakobsson et al.

Nature 451, 998-1003

Abstract

Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected—including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas—the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.

This paper uses ~500K single nucleotide polymorphism (SNP) markers and ~400 copy number variable (CNV) markers in 29 populations. From this, they construct neighbour-joining trees using SNP frequencies, inferred haplotypes or CNVs and compare the trees with the geographical location of the populations.

Fig2aConsidering differentiation (the FST statistics) between populations, they observe the expected increased differentiation between East Africans and other populations as a function of geographical distance from East Africa (see the figure on the left, cut from Fig. 2 in the paper). From what we know from previous studies, there is very little surprise here.

Fig2cThey then consider linkage equilibrium (LD) in some detail, both based on individual SNPs and inferred haplotypes (using an extension of the FastPHASE algorithm, as far as I understand the paper — but I haven’t checked the supplemental material) and show increased LD as a function of geographical distance from Africa, once again confirming the Out of Africa expansion of humans (Fig. 2c from the paper on the left).

The only really surprising discovery in this paper is that CNV variation is higher in Oceanian and American populations where in general variation decreases with distance from African (as the SNP analysis in this paper also confirms). I did not find an explanation for this in the paper, and I cannot think of a good explanation myself. We don’t really know that much about CNV polymorphism yet, at least not compared to SNP variation, so perhaps there are some interesting discoveries waiting for us here?


Jakobsson, M., Scholz, S.W., Scheet, P., Gibbs, J.R., VanLiere, J.M., Fung, H., Szpiech, Z.A., Degnan, J.H., Wang, K., Guerreiro, R., Bras, J.M., Schymick, J.C., Hernandez, D.G., Traynor, B.J., Simon-Sanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H.M., Hardy, J.A., Rosenberg, N.A., Singleton, A.B. (2008). Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451(7181), 998-1003. DOI: 10.1038/nature06742

Harmful mutations in Europeans and Africans

Saturday, February 23rd, 2008

ResearchBlogging.org

What I wanted to blog about yesterday, but didn’t get around to as I explained in the previous post, was two letters in the latest version of Nature on human variation and the distribution of deleterious mutations. I’ll split it in two posts; in this post I’ll discuss Lohmueller et al. Genetic Future beat me to it so I suggest you also read the dicussion there. The paper is also covered in the latest Nature Podcast and commented on at Nature. For a human evolution perspective, read John Hawks’ post on the topic.

Proportionally more deleterious genetic variation in European than in African populations

Lohmueller et al.

Abstract

Quantifying the number of deleterious mutations per diploid human genome is of crucial concern to both evolutionary and medical geneticists. Here we combine genome-wide polymorphism data from PCR-based exon resequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential single-nucleotide polymorphisms (SNPs) carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional SNPs considered, including synonymous, non-synonymous, predicted ‘benign’, predicted ‘possibly damaging’ and predicted ‘probably damaging’ SNPs. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations than in Europeans. EA individuals, in contrast, have significantly more genotypes homozygous for the derived allele at synonymous and non-synonymous SNPs and for the damaging allele at ‘probably damaging’ SNPs than AAs do. For SNPs segregating only in one population or the other, the proportion of non-synonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P < 2.3 x 10-37). We observe a similar proportional excess of SNPs that are inferred to be ‘probably damaging’ (15.9% in EA; 12.1% in AA; P < 3.3 x 10-11). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is probably a consequence of a bottleneck that Europeans experienced at about the time of the migration out of Africa.

In this paper, the authors compare the genetic variability in African decent and Euroean decent Americans, classify the variations according to estimated fitness, and how the “fitness” of the variations differ between the two populations.

Classifying variations and comparing the populations

Using genome-wide exon re-sequencing, the authors identified SNP variation in the sample and compared with the chimpanzee genome to infer ancestral and derived alleles. Ignoring for a bit the effect of mutations, just from knowing the variations and which alleles are ancestral and derived, we can learn about the history of the populations.

First off, we can consider the variation within the populations. Are there more variable sites in one population than in the other? Is there more heterogenity (meaning are people more likely to carry two different alleles) in one population or the other?

The results in the paper confirms previous studies that has shown that there are more variability in African than European decent individuals, matching the Out of Africa hypothesis. If humans originated in Africa — which everything indicates and I doubt anyone disagrees with any more — and populations outside Africa are relatively recent, then we expect the variability in Africa to be greater than outside Africa. A small population branching off a larger will only carry some of the variants with it, and it takes time for this to level out.

The SNPs can be classified in two categories: synonymous SNPs — those that do not change the amino acid the gene codes for — and non-synonymous — those that do. Roughly speaking, we expect the non-synonymous mutations to have an effect on fitness but not the synonymous. This is very rough, however, since the synonymous mutations can have major effects on regulation, splicing, etc., but still…

Using bioinformatics methods, the authors classify non-synonymous mutations into deleterious and non-deleterious mutations based on protein structure and conservation. They then observe that the deleterious mutations are relatively more frequent in European decent individuals.

Why is this an expected result?

To understand why this is the case, we turn to population genetics.

We expect deleterious mutations to be removed — or at least kept down in frequency — by selection, but there is a certain stochasticity in this. The frequency of an allele vary somewhat randomly in a population. Offspring will inherit one allele or the other with equal probability and pass that allele off to their offspring with equal probability. With no selection acting on the allele, the frequency will shrink or grow randomly until either fixed in the population or lost completely. When selection is acting on the allele, the number of offspring will depend on the alleles an individual carry. There is still a randomness, but the distribution of the number of offspring will change, more or less, depending on the strength of the selection.

How does this explain that there are more deleterious mutations in Europeans, then? This has to do with how stochastic the process really is.

Generally in stochastic processes, when we consider small numbers the variants in the process is larger than when we consider larger numbers. For very larger numbers, a stochastic process can behave almost deterministically, while for very small numbers the process can appear completely random.

A consequence of this is that weak selection requires a large population to have any observable effect over the background randomness of the process. The weaker the selection, the larger the population needs to be for the selection to have any effect.

If a population goes through a bottleneck, as the non-African populations are thought to have done, the selection that would act on the African population would have little effect on the non-African populations. Mutations that are selected against in the African population will not have been selected against in the non-African populations, simply because the selection wasn’t strong enough to have any effect in the smaller populations.

The paper finishes with a simulation study that shows that a bottleneck following the immigration out of Africa, followed by a population expansion, gives the observed pattern of variation, nicely confirming this.


Lohmueller, K.E., Indap, A.R., Schmidt, S., Boyko, A.R., Hernandez, R.D., Hubisz, M.J., Sninsky, J.J., White, T.J., Sunyaev, S.R., Nielsen, R., Clark, A.G., Bustamante, C.D. (2008). Proportionally more deleterious genetic variation in European than in African populations. Nature, 451(7181), 994-997. DOI: 10.1038/nature06611

Estimating parameters of speciation models

Monday, February 18th, 2008

Another paper that addresses the speciation process in apes is:

A new approach to estimate parameters of speciation models with application to apes

Becquet and Przeworski

Genome Research 17:1505-1519

Abstract

How populations diverge and give rise to distinct species remains a fundamental question in evolutionary biology, with important implications for a wide range of fields, from conservation genetics to human evolution. A promising approach is to estimate parameters of simple speciation models using polymorphism data from multiple loci. Existing methods, however, make a number of assumptions that severely limit their applicability, notably, no gene flow after the populations split and no intralocus recombination. To overcome these limitations, we developed a new Markov chain Monte Carlo method to estimate parameters of an isolation-migration model. The approach uses summaries of polymorphism data at multiple loci surveyed in a pair of diverging populations or closely related species and, importantly, allows for intralocus recombination. To illustrate its potential, we applied it to extensive polymorphism data from populations and species of apes, whose demographic histories are largely unknown. The isolation-migration model appears to provide a reasonable fit to the data. It suggests that the two chimpanzee species became reproductively isolated in allopatry ~850 Kya, while Western and Central chimpanzee populations split ~440 Kya but continued to exchange migrants. Similarly, Eastern and Western gorillas and Sumatran and Bornean orangutans appear to have experienced gene flow since their splits ~90 and over 250 Kya, respectively.

becquet-przeworski-fig1.pngIn this they develop a method to infer the coalescence parameters in a model that is essentially a population split with migration (click on the figure for details).

The effective population sizes, the Ns, tells us something about the diversity of the species (where NA tells us about the ancestral species). The split time, T, gives us the speciation time, and the migration parameter, m, tells us something about the way the speciation occured (an allopatric vs parapatric model).

As usual for coalescence models, the full likelihood of the parameters is computational demanding to compute, so the authors use summary statistics instead — somewhat like an Approximate Bayesian Computation (ABC) method if you can call it that when you want to match the summaries exactly — and then develop a Markov Chain Monte Carlo (MCMC) method to sample from the likelihood function over the summary statistics.

Based on this model, they then estimate speciation times for sub-species of chimps, gorillas and orangutans.


Citation for Research Blogging:Becquet, C., Przeworski, M. (2007). A new approach to estimate parameters of speciation models with application to apes. Genome Research, 17(10), 1505-1519. DOI: 10.1101/gr.6409707

A map of recent selection in humans

Saturday, February 16th, 2008

ResearchBlogging.org

I am currently involved in a study where we have a gene showing both disease association and high differentiation between Africans and Europeans/Asians (as far as we can see from HapMap data). Sorry, I cannot give more details right now.

Anyway, because of this study I finally got around to reading this paper:

A Map of Recent Positive Selection in the Human Genome

Voight BF, Kudaravalli S, Wen X, Pritchard JK.
PLoS Biology 2007 4(3): e72 doi:10.1371/journal.pbio.0040072

Abstract

The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest ~250 signals of recent selection in each population.

I knew of the results already from a talk by Jonathan Pritchard that I attended this summer, but I hadn’t read the paper until now.

The idea is pretty neat: by looking at the haplotypes around a SNP, and how they break down with distance from the SNP, you can spot which SNPs have changed rapidly from low frequency to higher frequency and these SNPs are candidates for being under selection.

This is illustrated nicely in Figure 1 from the paper:

Break-down of haplotypes around a SNP

 

A) Decay of haplotypes in a single region in which a new selected allele (red, center column) is sweeping to fixation, replacing the ancestral allele (blue). Horizontal lines are haplotypes; SNP positions are marked below the haplotype plot using blue for SNPs with intermediate allele frequencies (minor allele >0.2), and red otherwise. For a given SNP, adjacent haplotypes with the same color carry identical genotypes everywhere between that SNP and the central (selected) site. The left- and right-hand sides are sorted separately. Haplotypes are no longer plotted beyond the points at which they become unique.

B) Decay of haplotype homozygosity for ten replicate simulations. When the core SNP is neutral (σ = 0; left side) the haplotype homozygosity decays at similar rates for both ancestral and derived alleles. When the derived alleles are favored (σ = 2Ns = 250; right side), the haplotype homozygosity decays much slower for the derived alleles than for the ancestral alleles. The discrepancy in the overall areas spanned by these two curves forms the basis of our text for selection (iHS).


The citation was (for the benefit of Research Blogging):
Voight, B., Kudaravalli, S., Wen, X., Pritchard, J. (2006). A Map of Recent Positive Selection in the Human Genome. PLoS Biology, 4(3).

Mapping human genetic ancestry

Wednesday, January 30th, 2008

Yesterday I read the paper

Mapping human genetic ancestry I. Ebersberger et al.Molecular Biology and Evolution 2007 24(10):2266-2276

that addresses the same problem that we addressed in

Genomic relationships and speciation times of human, chimpanzee and gorilla infered from a coalescent hidden Markov model A. Hobolth et al.PLoS Genetics 2007 3(2): doi:10.1371/journal.pgen.0030007

although taking a different approach to the problem but using a lot more data.

Tracing the ancestry of the human genome

Species trees and gene treesHuman’s closest living relatives are the chimps and the closest relatives to human and chimps are the gorillas, but the species are so closely related that not all of the genome follows the species genealogy. Click on the figure on the right to get an illustration of this.The reason this happens is that as we trace the history of a piece of our DNA back in time, we will necessarily find the most recent common ancestor of humans and chimps further back in time than the speciation time of humans and chimps. If this time is so far back that it also precedes the speciation time of the human/chimp ancestor and the gorilla ancestor, then the most recent common ancestor of chimps and gorillas, or humans and gorillas, might be younger than the most recent common ancestor of all the species.Looking at the DNA of the three species we can infer the average time in the past where the DNA splits into the different species and using coalescent theory we can then infer the speciation times.In Hobolth et al. we approximated the coalescent process using a hidden Markov model which enabled us to efficiently analyse large alignments of DNA sequences and from this extract the parameters needed to infer speciation times, information about the diversity in ancestral species and to annotate the alignments with the most likely genealogy e.g. showing us in which part of our genome we are closer related to gorillas than to chimps.

CoalHMM

We applied this to five large alignments, but covering only a small fraction of the entire genome.In Ebersberger et al. they construct a large number of (smaller) alignments covering the entire genome and consider the same problem in analysing this data.The statistical model they use is slightly less sophisticated than what we did, but that is probably more than compensated for by the much larger data-set. What they do is construct a single tree for each alignment, by picking the most likely phylogeny of all the possible, discarding alignments when there is no clear winner.They then use coalescent theory to infer the diversity of the ancestral species measured as the parameter Ne (effective population size) — essentially doing the same as we did — but as far as I understand they equate DNA divergence time with speciation time which strictly speaking is incorrect (I might be wrong here, I didn’t check in detail how they inferred the time interval between human/chimp divergence and their divergence from the gorilla).

Diversity of the human-chimp ancestor along the human genome

A plot of diversity is shown on the bottom half of the figure on the right. Click to enlarge.

Their estimates of Ne are pretty close to ours (65,000 ± 30,000). This is pretty good news, considering that the results come about using different methods (although based on the same underlying theory).

However, the assumptions we put into the analysis differs. To calibrate the molecular clock in the analysis we both use the divergence time from the orangutan, but where we used 18 million years (Myr) ago they use 16Myr ago. The generation time is also very important in estimating the divergence and where we used 25 years as the average generation time they used 20 years. Our estimate of generation time is a bit on the high side — Ebersberger et al. calls unrealistically high — but we really had no idea what to use here when we did our analysis.

How much have these assumptions affected the results?

With help from Julien Dutheil — who has just re-written the entire CoalHMM software — I got the numbers our analysis would have obtained had we used the assumptions from Ebersberger et al. The human-chimp divergence we estimate is 5.1 Myr (as opposed to their 5.7) and the divergence with the gorilla we estimate to 8.4 Myr (as opposed to their 7.8). This is reasonably close enough to be the same. When we then estimate the speciation time — where the generation time assumption is important — we get 3.6Myr for the human/chimp speciation and 5.7 Myr for the (human/chimp)/gorilla speciation. These look very recent to me, and I don’t fully trust them. I have seen numbers around 4 Myr for the closest distance between human and chimp, but the fossil record just doesn’t match that.

For the Ne estimate, the new assumptions give us a whooping 81,000 for the human/chimp ancestor. I’m not really sure why. Using their assumptions moves us further from their estimates. This is probably worth looking into.


Citations, for Research Blogging:Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., von Haeseler, A. (2007). Mapping Human Genetic Ancestry. Molecular biology and evolution, 24(10), 2266-2276.Hobolth, A., Christensen, O.F., Mailund, T., Schierup, M.H. (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLoS Genetics, 3(2), e7. DOI: 10.1371/journal.pgen.0030007