Estimating admixture proportions

I am not entirely sure about this, but something seems wrong to me in a number of papers I have read recently.

A couple of them I even reviewed before they were published so if I am right in my suspicion I am partly responsible.

Anyway, it has to do with estimating the admixture proportions when one population, let’s call it X, is admixed between two other populations, A and B, say. Rather, two populations A’ and B’, A’ closely related to A and B’ closely related to B, admixed to create the population X’ ancestral to X. X’ was created with a proportion of α from A’ and β=1-α from B’.

We want to estimate α.

In Durand et al. (2011) we get a test for this. It is based on counting ABBA-BABA patterns — essentially the D statistics without normalisation — and comparing these for two selected quartets of populations. They call it the f^ estimator and it is described around equation (7) and (8).

First there is one version where — in terms of the populations I described above — you compare the quartet (A, X, B and O) with (A1, A2, B, O) with two samples from A. The idea here is, as far as I understand, that A2 must be completely “A” so we see a contrast to how much X is compared to someone who is completly A.

There is nothing wrong with that, but it isn’t an estimate of the admixture proportions. It doesn’t take into account that “A-ness” has evolved since the admixture time — potentially for a long time if that event is far back in time — so we are seeing both the admixture and that evolution.

The second version takes another sequence related to A but that branched off before the admixture event. If we use that version we can actually get an estimate of the admixture proportions.

I will shortly explain how, but just mention that the thing that worries me is that I see the first case being used to estimate the proportions with (generally) acknowleding that it isn’t what it is doing; worse if you compare two populations to figure out how admixed they are and you ignore this problem, how do you know that it is the admixture proportions you are measuring and not the drift after that admixture event?

Okay, to the estimator.

I find it easier to think in terms of the f4 statistics from Patterson et al. (2012). In general the way of thinking about drift evolving along admixture graphs I find extremely elegant and easy to reason about, at least compared to counts of site patterns.

The f4 statistics — which is essentially the D statistics so very similar to the Durand ABBA-BABA counts — captures the overlap between the “drift flow” between two pairs of populations. f4(A,B;C,O) for example is the drift on the overlap of the path from A to B and from C to O. That is the overlap between the blue and the green line, or the drift on edge x. f4(A,B;C,O) = f4(C,O;A,B) = x

When there is admixture, the drift from one population to another takes more than one path, so for example the drift from X to B takes two different routes, one over the edge close to A, with probability alpha, and one over the edge close to B, with probability beta. For f4(C,O;X,B) we therefore again have the only overlap on edge x but we only take that path with probality alpha (the path we take with probability beta doesn’t overlap the path from C to O so it doesn’t get counted). f4(C,O;X,B) = αx.

Since f4(C,O;A,B) = x and f4(C,O;X,B) = αx we can estimate α as f4(C,O;X,B)/f4(C,O;A,B). This is called the f4 ratio estimator in Patterson et al. and is essentially the same as the second f^ estimator from Durand et al.

When the admixture event — or at least the branching off of the population that will admix — is ancestral to both A and C we have a different topology so the ratio is not equal to alpha. f4(C,O;A,B) = x + y so now we have f4(C,O;X,B)/f4(C,O;A,B) = αx / (x + y).

It is a lower bound for alpha, but how much below alpha you get depends on the length of branch y.

Unless I am misunderstanding the f^ statistics, and it is very different from the f4 ratio estimator, I think I am seeing several papers estimating alpha using the second topology. All those estimates are then too low.

Or am I missing something?

Durand, E.Y. et al., 2011. Testing for ancient admixture between closely related populations. Molecular Biology and Evolution, 28(8), pp.2239–2252.

Patterson, N. et al., 2012. Ancient admixture in human history. Genetics, 192(3), pp.1065–1093.

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

6 thoughts on “Estimating admixture proportions”

  1. Hi Thomas,

    There’s a lot of great stuff in there, and many things to agree with (especially how people are misusing f^). But in the last example, re-casting it in the ABBA-BABA framework, there should be no signal: if the admixture event happened before A and C diverged then the topology is not changed because A and C are equally related to B. So would anyone apply this method in these cases?


  2. Well, you are still looking at X in the second case. You are doing exactly the same as for the first topology, comparing f4(C,O;A,X) with f4(C,O;A,B). The statistics for the first will be the drift alpha*x and the second will be x+y; they won’t be zero unless alpha is zero or x+y is zero.

    The estimates will actually look quite reasonable. They will just be too low because you are dividing by x+y instead of x.

    In Schubert et al. (2014) they compare two domesticated horses with a wild horse, where they also conclude that the admixture happened ancestral to the split between the domesticated horses, so that is exactly the second topology. They do acknowledge that it is an underestimate but I am not sure they explain that this is one of the reasons.

    In Cahill et al. 2013 and 2014 they do it for bears where they use two polar bears that are likely to have diverged after the admixture event. They don’t directly state that but they do say that all polar bears are equally related to brown bears so even if they didn’t assume that they wouldn’t be guaranteed to use the right topology (they wouldn’t know which was A and C).

    The statistics in Martin et al (2014) definitely has this problem; there they pick the same population for A and C.

    People definitely use the last topology for this…

    M. Schubert, H. Jónsson, D. Chang, C. Der Sarkissian, L. Ermini, A. Ginolhac, A. Albrechtsen, I. Dupanloup, A. Foucal, B. Petersen, M. Fumagalli, M. Raghavan, A. Seguin-Orlando, T. S. Korneliussen, A. M. V. Velazquez, J. Stenderup, C. A. Hoover, C.-J. Rubin, A. H. Alfarhan, S. A. Alquraishi, K. A. S. Al-Rasheid, D. E. MacHugh, T. Kalbfleisch, J. N. MacLeod, E. M. Rubin, T. Sicheritz-Ponten, L. Andersson, M. Hofreiter, T. Marquès-Bonet, M. T. P. Gilbert, R. Nielsen, L. Excoffier, E. Willerslev, B. Shapiro, and L. Orlando, “Prehistoric genomes reveal the genetic foundation and cost of horse domestication,” Proc Natl Acad Sci U S A, p. 201416991, Dec. 2014.

    J. A. Cahill, R. E. Green, T. L. Fulton, M. Stiller, F. Jay, N. Ovsyanikov, R. Salamzade, J. St John, I. Stirling, M. Slatkin, and B. Shapiro, “Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution.,” PLoS Genet, vol. 9, no. 3, pp. e1003345–e1003345, Mar. 2013.

    J. A. Cahill, I. Stirling, L. Kistler, R. Salamzade, E. Ersmark, T. L. Fulton, M. Stiller, R. E. Green, and B. Shapiro, “Genomic evidence of geographically widespread effect of gene flow from polar bears into brown bears,” Molecular Ecology, (to appear) 2014.

    S. H. Martin, J. W. Davey, and C. D. Jiggins, “Evaluating the Use of ABBA–BABA Statistics to Locate Introgressed Loci,” Mol Biol Evol, (to appear) 2014.

  3. Okay, I’ve read those papers and think I see it.

    It is horribly confusing that everyone assigns P1, P2, and P3 differently in all these papers. In Durand P3 is the donor/introgressor and P2 is the acceptor, but in Cahill P1 is the acceptor. In your example, transforming it into Durand’s notation, A=P3, X=P2, and B=P1. Correct?

  4. Well, yeah I guess. The thing is, the difference between the populations is really related to the admixture proportions, right? They usually draw the topology as ((P1,P2),P3) with P2 admixed, but if you flip alpha and beta you get ((P3,P2),P1). It makes sense to have the tree topology if P2 gets most of its genes from the population closest to P1 but if alpha is close to 50% it could really be either tree.

    The notation I have used above would, in Durand et al. be A=P1, X=P2 and B=P3 (or you can flip P1 and P3 if you want).

  5. Hi Thomas,
    Thanks a lot for the post, nice explanations. I have a question: Do you think that recombination matters? I mean it could be that some part of the genome might have the last ‘bad’ topology with x+y and some other part of the genome the ‘good’ with only x. If this is true, then it will be more difficult to interpret the results. Or I’m missing something here?

  6. Recombination matters but these statistics use genome wide patters so it won’t influence it. Locally, recombination matters a lot

Leave a Reply