Estimating local ancestry
When two populations A and B meet and start to mix, the resulting population will -- for the first many generations, at least -- be a mix of the two original populations. It is not that each individual will belong to one of the original populations and that the mixed population will consist of such "original population" A or B individuals. At least not after a few generations. Instead, each individual will be a mix of population A and B.
For several generations following the merge of populations A and B, mutations will not change the genes in the mixed population much. It takes a long time for mutations to accumulate. Each gene in the population will essentially be unchanged compared to the gene in one (or both) of the ancestor populations. Here, by gene, I simply mean a "chunk" of DNA, not necessarily a functional bit, so don't read too much into it. In any case, the genes will not change much, but will look like genes from A or from B, where of course genes from A can differ significantly fro genes from B.
Recombination will shuffle the genes from A and B around, however. If a "mainly A" individual mates with a "mainly B" individual, the offspring will inherit both A and B genes in some combination. As you scan along a chromosome, the ancestral population will change back and forth between A genes and B genes.
Just based on samples from the present day chromosomes, can we infer the local ancestry, i.e. which chunks of each chromosome came from A and which came from B? In this months issue of American Journal of Human Genetics, there is a paper that addresses this exact problem:
Estimating Local Ancestry in Admixed Populations
Sankararaman et al.
The American Journal of Human Genetics 82(2) 290-30
Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population substructure could produce spurious associations. Population substructure can be caused by the presence of two distinct subpopulations or a single pool of admixed individuals. In this work, we focus on the latter, which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci that are different among different populations and are still associated with a disease. We evaluated current methods for inference of population substructure in such cases and show that they might be quite inaccurate even in relatively simple scenarios. We therefore introduce a new method, LAMP (Local Ancestry in adMixed Populations), which infers the ancestry of each individual at every single-nucleotide polymorphism (SNP). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote. Our empirical results show that LAMP is significantly more accurate and more efficient than existing methods for inferrring locus-specific ancestries, enabling it to handle large-scale datasets. We further show that LAMP can be used to estimate the individual admixture of each individual. Our experimental evaluation indicates that this extension yields a considerably more accurate estimate of individual admixture than state-of-the-art methods such as STRUCTURE or EIGENSTRAT, which are frequently used for the correction of population stratification in association studies
Inferring local ancestry
The method in this paper makes a few simplifying assumptions that makes the method computational efficient to run.
They assume that the samples considered are from a mix of populations that contributed to the sample in known frequencies -- e.g. that population A contributed with 80% and B with 20% -- a known number of generations ago. The method is not that sensitive to knowing exactly the fractions of populations or the number of generations -- and using other methods you can infer these parameters anyway -- but assuming that you know these parameters helps in the mathematics of the method.
A more important assumption is that you can split the chromosomes into sliding windows where recombinations do not occur inside the windows. This is obviously not correct, but it helps the method a lot and is not as silly as it sounds.
If you can split the chromosomes into sliding windows of a fixed length and then infer which chromosomes belong to which ancestral population, the inference problem is much easier to solve. If you then slide this window along the chromosomes and assign populations to the chromosomes in each window, each nucleotide will belong to different populations depending on the window considered.
The same nucleotide will belong to different populations depending on the window. Is this a problem? Yes and no.
For the windows that overlap a given nucleotide, the method takes a vote, and the majority decides which population the nucleotide "really" belongs to. That way you get a unique population per nucleotide.
This is a pretty good idea. This way you get the fast computation of the ancestral population inference and on average you assign the right population to the nucleotides.
You will not be able to accurately infer the break-points where the population changes between populations, but for most applications that is not that important in the first place. You want to assign the nucleotide to the right population on average, and this is what you achieve this way.
Relevance for association mapping
The motivation for the paper is association mapping, where you compare the frequency of alleles between cases and controls for a given disease, looking for markers where the frequency is different between cases and controls. Such markers are potential candidates for disease genes: if one allele is more frequent in cases than in controls, maybe it is more frequent because it increases the risk of the disease.
If your samples are from a population that is a mix of different ancestral populations, there is a high risk of biases. There is a bias in the sampling: if the ancestral populations are sampled in different ratios for cases than controls, you will pick up differences in cases and controls just because of that.
There are obvious situations where this can happen. If you sample cancer patients from an expensive clinic (rich white Americans) and the controls from the ER (with a higher ratio of African Americans), for example, you get a different ratio of African decent and European decent individuals in cases and controls.
If you do not correct for this, you are mapping "ancestral population" genes instead of disease genes.
Comparison to CoalHMMs
I didn't actually read the paper with association mapping in mind -- although the problem is extremely relevant for such studies. I do association mapping in Icelanders, so it is not that important for my own work, though.
I read it for a meeting in our "coalescent hidden Markov model" group.
With our CoalHMMs, we try to learn about speciation events. When there is lineage sorting in the speciation -- as for example between humans, chimps and gorillas -- the nearest neighbour species of a chromosome changes along the chromosome -- in some regions humans are closer related to chimps, in other closer related to gorillas.
This setting is different than the population mixing problem. For one thing, we are not dealing with different ancestral populations mixing, but rather populations splitting up to become separate species. Still, scanning along the chromosomes and inferring which phylogeny each nucleotide belongs to is similar to the problem here.
SANKARARAMAN, S., SRIDHAR, S., KIMMEL, G., HALPERIN, E. (2008). Estimating Local Ancestry in Admixed Populations. The American Journal of Human Genetics, 82(2), 290-303. DOI: 10.1016/j.ajhg.2007.09.022