19 Aug

Strong selective sweeps on the X chromosome of the human-chimpanzee ancestor

We just published a new paper a few days ago title Strong Selective Sweeps on the X Chromosome in the Human-Chimpanzee Ancestor Explain Its Low Divergence


The human and chimpanzee X chromosomes are less divergent than expected based on autosomal divergence. We study incomplete lineage sorting patterns between humans, chimpanzees and gorillas to show that this low divergence can be entirely explained by megabase-sized regions comprising one-third of the X chromosome, where polymorphism in the human-chimpanzee ancestral species was severely reduced. We show that background selection can explain at most 10% of this reduction of diversity in the ancestor. Instead, we show that several strong selective sweeps in the ancestral species can explain it. We also report evidence of population specific sweeps in extant humans that overlap the regions of low diversity in the ancestral species. These regions further correspond to chromosomal sections shown to be devoid of Neanderthal introgression into modern humans. This suggests that the same X-linked regions that undergo selective sweeps are among the first to form reproductive barriers between diverging species. We hypothesize that meiotic drive is the underlying mechanism causing these two observations.

We’ve been working on the differences between the autosomes and the X chromosome for a little while now, but this work actually started earlier than this, while we were working on the gorilla genome analysis.

There was an interesting observation made by Patterson et al. (2006) that the divergence between humans and chimpanzees is much lower on the X chromosome than on the autosomes, even taking into account the smaller effective population size on the X chromosome in the common ancestor. They interpreted this, combined with seeing a large variation on coalescence times on the autosomes, as suggestive of strong population structure in the ancestral species, in particular as suggestive that the human ancestor was admixed between a lineage closely related to the chimpanzee ancestor and another lineage that diverged earlier from the chimpanzee.

Most of the patterns they observed could also be explained by just having a very large effective population size in the ancestral species, but the difference in divergence between the autosomes and the X chromosome was still a puzzle.

The X divergence is simply too small compared to the expectation from just having an effective population size of 3/4 on the X chromosome compared to the autosomes (there are only 3/4 as many X chromosomes as autosomes in a population (assuming the same number of males and females) and while different breeding success in males and females is expected to deviate from the 3/4 somewhat the divergence is really very small for that to be the explanation — especially considering that we don’t see that large differences elsewhere in great apes).

While working on the gorilla genome paper we used our coalescent hidden Markov model approach to examine this. The CoalHMM lets us look at incomplete lineage sorting along the genome which is a strong proxy for the diversity in the ancestral species. And while we clearly saw the reduced divergence on the X chromosome between humans and chimpanzees we also clearly saw a reduction in ILS — something also observed in the Patterson et al. paper — suggesting that the explanation is not a lower mutation rate but reduced diversity in the ancestral species.

Interestingly, we observed that this reduction in ILS was not uniform along the X chromosome. This is what we have studied in greater detail in the new paper.

What we see is that the ancestral diversity on the X chromosome looks bimodal. On some parts of the X chromosome we see the 3/4 diversity we would expect and in other parts of the chromsome it is much reduced compared to that.

Selection is expected to reduce diversity. Purifying selection removes diversity because deleterious mutations are purged from the population and positive selection creates a sweep where part of the genome has coalesced very recently.

Very wide regions with low diversity is hard to create by purifying selection — it depends a little on the math and the assumptions that goes into the models, but generally we don’t expect it based on the models we have and we don’t see it when we try to simulate the effect of such selection.

Positive selection — selective sweeps — can create it, but very wide regions as those we observe require either very strong sweeps or recurrent sweeps.

Whatever creates it, it is clearly there, and that is a bit of a puzzle.

We couldn’t really come up with any explanation for the pattern we see except for strong selection. That doesn’t mean that it is selection, of course, but we are completely blank on what it could be if it isn’t.

We don’t really have an explanation for it in the paper. We don’t really have an explanation, period, actually. We do provide a hypothesis for something it could be, but we don’t have any strong evidence for it, so it is something we are still working on figuring out.

Our results do not rule out the complex speciation scenario suggested by Patterson et al. Hybrid incompatibility could be the selection that we are observing, and very interestingly the regions we see the reduced ancestral divergence in are exactly the regions where we see no Neanderthal introgression in non-African modern humans which suggests it could be the same mechanism we see in the human-chimpanzee ancestor.

However, we also see reduced diversity in extant great apes in the same regions so unless there is hybridisation and barriers against gene flow in these regions everywhere in the great ape phylogeny, there is something more complicated — and more interesting — going on.

We are now trying to figure out what that could be.

24 Jun

PubChase and Papers

I wanted to try out pubchase since it sounds like a great way of getting updates on papers I would want to read. I have uploaded my Mendeley library there, but it will take a few days before it can give me recommendations, apparently. I look forward to seeing how it works.

I have been pretty happy with how ReadCube performs at finding recommendations, but while I love the Enhanced PDF feature in ReadCube I still prefer to use Papers for managing my papers and citations (mainly because I’ve had a lot of trouble using ReadCube for citations in a paper). Papers doesn’t do recommendations, though, and doesn’t have features that lets you track who have cited a paper and things like that. It’s been a long while since I used Mendeley.

I have set up a workflow for automatically importing papers I download in ReadCube into Papers. Is there any easy way of automatically sending papers from pubchase to Papers?

22 Jun

Admixture graph R package

The last couple of months I have worked on and off on an R package for modelling and testing admixture graphs.

You can download it from github or install it directly in R using:

I know, the github repository has an underscore and the package name does not. R packages can’t have an underscore in their name and I didn’t think about it when I made the repository, so that is how it is right now.

Building admixture graphs

I’m using the package in a couple of projects right now where I’m using it to fit graphs to data. The data I work with is D statistics — I don’t compute those in the package but use ADMIXTOOLS — and I use the package to extract equations for the expected values of these statistics and for fitting graph parameters (edge lengths and admixture proportions) to the data.

It is similar to the pqGraph tool from ADMIXTOOLS (that I have never managed to run) except that I don’t compute error bars on parameters yet. I still have to find a good way of doing that. I have some ideas, but it is a bit more complicated than you might think.

Anyway, the code for specifying graphs is a bit crude but pretty straightforward. The code below builds and plots a graph.

Admixture graph

Admixture graph

Fitting graphs

With a data frame with columns W, X, Y, Z, and D (the first four should be samples and D is then the D(W,X;Y,Z) statistics) you can then fit a graph to the data.

The interface works with magrittr or dplyr pipelines so you can write something like

to fit the graph parameters and plot the fit.


Fitted data

Fitted data

You can also extract the fitted parameters from the result of fit_graph() using the coefficients() function, get the fitted values using the fitted() function, and in general use the usual interface for fitted models in R.

Except for confidence intervals with confint(). As I wrote above, I haven’t quite figured out how to do that yet.

It is not terribly solid code yet — it is more likely to crash with a meaningless error message than a meaningful one — but I am working on improving that. If anyone can find a use for it, and give some feedback, I would much appreciate it.

22 Jun

An early modern human from Romania with a recent Neanderthal ancestor

New interesting paper out: An early modern human from Romania with a recent Neanderthal ancestor Fu et al.

Neanderthals are thought to have disappeared in Europe approximately 39,000–41,000 years ago but they have contributed 1–3% of the DNA of present-day people in Eurasia. Here we analyse DNA from a 37,000–42,000-year-old modern human from Pestera cu Oase, Romania. Although the specimen contains small amounts of human DNA, we use an enrichment strategy to isolate sites that are informative about its relationship to Neanderthals and present-day humans. We find that on the order of 6–9% of the genome of the Oase individual is derived from Neanderthals, more than any other modern human sequenced to date. Three chromosomal segments of Neanderthal ancestry are over 50 centimorgans in size, indicating that this individual had a Neanderthal ancestor as recently as four to six generations back. However, the Oase individual does not share more alleles with later Europeans than with East Asians, suggesting that the Oase population did not contribute substantially to later humans in Europe.

I actually heard about this back in May in Cold Spring Harbor, but it is great to be able to actually read the paper.

What especially excites me about this paper is that it is hinting at admixture between Neanderthals and modern humans was pretty common. For some values of “pretty” and “common” at least. What we know so far about Neanderthal admixture is that all non-Africans have a little but that Asians (and Native Americans) have a little more than Europeans — somewhat surprising — and that this most likely is because of a second admixture into Asians.

There was a suggestion that the higher level of Neanderthals in Asians was caused by less negative selection removing it from Asians but a couple of papers argues against that (see e.g. Vernon & Akey 2015).

So there was the admixture event ancestral to all Eurasians, another into Asians, and now a third admixture event.

This individual — and I guess this admixture event — didn’t leave much (if any) genes in extant populations, so it isn’t part of the admixture that we see the results of today, but it was an admixture event between modern humans and Neanderthals nevertheless.

There is a rule of thumb that says: “zero, one, or many” — either something never happens, it is rare enough that it only happens once, or it happens a lot. It is not a strong rule, but now that we have evidence for at least three admixture events it seems hard to argue that interbreeding between modern humans and Neanderthals was rare.

The strange thing now is how uniform the level of Neanderthals is in Eurasians after all. I think I would expect some geographic differences, but maybe such differences have just been evened out by later migration. I don’t know. I don’t really have a strong intuition for that.

Vernot, B., & Akey, J. M. (2015). Complex History of Admixture between Modern Humans and Neandertals. American Journal of Human Genetics.

28 May

Damn simple suffix array construction

As you may or may not know, suffix arrays are the key data structure for many string algorithms. Well, the key data structure is suffix trees but those take up way too much memory so we use suffix arrays instead because they can do the same things (with a little extra bookkeeping) while taking up a lot less memory.

We just finished our class on string algorithms this week and Storm (my co-teacher) and I were talking about simple suffix array constructions. There are many different algorithms for constructing suffix arrays. There are several algorithms for constructing suffix arrays in linear time — although in practise these are often slower than algorithms with worse big-O complexities — but we were just talking about very easy implementations of them.

In C, where strings are just pointers to to null-terminated byte buffers you can get a representation of all suffixes of a string — so all substrings that ends at the same point as the original string — by having a pointers into that string. So it is easy to represent the set of all suffixes as an array of pointers.

The suffix array is just an array that contains the rank (the order) of all suffixes, so if you can just soft such an array you have constructed the suffix array.

In C you can do that just by sorting the array of pointers.

The worst-case complexity of this is pretty bad — you have an O(n log n) complexity for sorting and each comparison could in principle take time n as well since you have to compare strings that on average n in length — but in practise you can compare strings a lot faster since they do not typically share long prefixes (so you can terminate the comparison quickly) and the n log n complexity doesn’t matter much when you are using an optimised search algorithm.

So you should be able to sort the suffixes in a few lines of code using qsort and strcmp.

You need a wrapper to make qsort work with strcmp, but other than that, it is really that simple.

Below is my implementation of that. It is a complete program that takes a string on the command line and outputs the suffix array for it.

I also made a version that reads a string from a file and on my laptop I can build a suffix array for a file that is around a hundred megabases in a minut or so (for a human chromosome) so it isn’t that bad in performance.

Here is my simple implementation: