Back up again

Well, I guess I’m back…

When I wrote the post on admixture proportions the other day I got back to this blog after having neglected it for a very long time. The wordpress dashboard was lighted up with necessary update and half of them required an update of the underlying software, such as MySQL and PHP.

I couldn’t do that myself so I asked to get it updated on the hosting server, which I got, but in the process the site moved server, so it has been offline a bit. First to fix some software issue and then also because it takes a little while for the various DNS servers to update their cache.

Anyway, from where I’m sitting now, the site is up and running again.

Estimating admixture proportions

I am not entirely sure about this, but something seems wrong to me in a number of papers I have read recently.

A couple of them I even reviewed before they were published so if I am right in my suspicion I am partly responsible.

Anyway, it has to do with estimating the admixture proportions when one population, let’s call it X, is admixed between two other populations, A and B, say. Rather, two populations A’ and B’, A’ closely related to A and B’ closely related to B, admixed to create the population X’ ancestral to X. X’ was created with a proportion of α from A’ and β=1-α from B’.

We want to estimate α.

In Durand et al. (2011) we get a test for this. It is based on counting ABBA-BABA patterns — essentially the D statistics without normalisation — and comparing these for two selected quartets of populations. They call it the f^ estimator and it is described around equation (7) and (8).

First there is one version where — in terms of the populations I described above — you compare the quartet (A, X, B and O) with (A1, A2, B, O) with two samples from A. The idea here is, as far as I understand, that A2 must be completely “A” so we see a contrast to how much X is compared to someone who is completly A.

There is nothing wrong with that, but it isn’t an estimate of the admixture proportions. It doesn’t take into account that “A-ness” has evolved since the admixture time — potentially for a long time if that event is far back in time — so we are seeing both the admixture and that evolution.

The second version takes another sequence related to A but that branched off before the admixture event. If we use that version we can actually get an estimate of the admixture proportions.

I will shortly explain how, but just mention that the thing that worries me is that I see the first case being used to estimate the proportions with (generally) acknowleding that it isn’t what it is doing; worse if you compare two populations to figure out how admixed they are and you ignore this problem, how do you know that it is the admixture proportions you are measuring and not the drift after that admixture event?

Okay, to the estimator.

I find it easier to think in terms of the f4 statistics from Patterson et al. (2012). In general the way of thinking about drift evolving along admixture graphs I find extremely elegant and easy to reason about, at least compared to counts of site patterns.

The f4 statistics — which is essentially the D statistics so very similar to the Durand ABBA-BABA counts — captures the overlap between the “drift flow” between two pairs of populations. f4(A,B;C,O) for example is the drift on the overlap of the path from A to B and from C to O. That is the overlap between the blue and the green line, or the drift on edge x. f4(A,B;C,O) = f4(C,O;A,B) = x

When there is admixture, the drift from one population to another takes more than one path, so for example the drift from X to B takes two different routes, one over the edge close to A, with probability alpha, and one over the edge close to B, with probability beta. For f4(C,O;X,B) we therefore again have the only overlap on edge x but we only take that path with probality alpha (the path we take with probability beta doesn’t overlap the path from C to O so it doesn’t get counted). f4(C,O;X,B) = αx.

Since f4(C,O;A,B) = x and f4(C,O;X,B) = αx we can estimate α as f4(C,O;X,B)/f4(C,O;A,B). This is called the f4 ratio estimator in Patterson et al. and is essentially the same as the second f^ estimator from Durand et al.

When the admixture event — or at least the branching off of the population that will admix — is ancestral to both A and C we have a different topology so the ratio is not equal to alpha. f4(C,O;A,B) = x + y so now we have f4(C,O;X,B)/f4(C,O;A,B) = αx / (x + y).

It is a lower bound for alpha, but how much below alpha you get depends on the length of branch y.

Unless I am misunderstanding the f^ statistics, and it is very different from the f4 ratio estimator, I think I am seeing several papers estimating alpha using the second topology. All those estimates are then too low.

Or am I missing something?

Durand, E.Y. et al., 2011. Testing for ancient admixture between closely related populations. Molecular Biology and Evolution, 28(8), pp.2239–2252.

Patterson, N. et al., 2012. Ancient admixture in human history. Genetics, 192(3), pp.1065–1093.