24 Jun

PubChase and Papers

I wanted to try out pubchase since it sounds like a great way of getting updates on papers I would want to read. I have uploaded my Mendeley library there, but it will take a few days before it can give me recommendations, apparently. I look forward to seeing how it works.

I have been pretty happy with how ReadCube performs at finding recommendations, but while I love the Enhanced PDF feature in ReadCube I still prefer to use Papers for managing my papers and citations (mainly because I’ve had a lot of trouble using ReadCube for citations in a paper). Papers doesn’t do recommendations, though, and doesn’t have features that lets you track who have cited a paper and things like that. It’s been a long while since I used Mendeley.

I have set up a workflow for automatically importing papers I download in ReadCube into Papers. Is there any easy way of automatically sending papers from pubchase to Papers?

22 Jun

Admixture graph R package

The last couple of months I have worked on and off on an R package for modelling and testing admixture graphs.

You can download it from github or install it directly in R using:

I know, the github repository has an underscore and the package name does not. R packages can’t have an underscore in their name and I didn’t think about it when I made the repository, so that is how it is right now.

Building admixture graphs

I’m using the package in a couple of projects right now where I’m using it to fit graphs to data. The data I work with is D statistics — I don’t compute those in the package but use ADMIXTOOLS — and I use the package to extract equations for the expected values of these statistics and for fitting graph parameters (edge lengths and admixture proportions) to the data.

It is similar to the pqGraph tool from ADMIXTOOLS (that I have never managed to run) except that I don’t compute error bars on parameters yet. I still have to find a good way of doing that. I have some ideas, but it is a bit more complicated than you might think.

Anyway, the code for specifying graphs is a bit crude but pretty straightforward. The code below builds and plots a graph.

Admixture graph

Admixture graph

Fitting graphs

With a data frame with columns W, X, Y, Z, and D (the first four should be samples and D is then the D(W,X;Y,Z) statistics) you can then fit a graph to the data.

The interface works with magrittr or dplyr pipelines so you can write something like

to fit the graph parameters and plot the fit.

 

Fitted data

Fitted data

You can also extract the fitted parameters from the result of fit_graph() using the coefficients() function, get the fitted values using the fitted() function, and in general use the usual interface for fitted models in R.

Except for confidence intervals with confint(). As I wrote above, I haven’t quite figured out how to do that yet.

It is not terribly solid code yet — it is more likely to crash with a meaningless error message than a meaningful one — but I am working on improving that. If anyone can find a use for it, and give some feedback, I would much appreciate it.

22 Jun

An early modern human from Romania with a recent Neanderthal ancestor

New interesting paper out: An early modern human from Romania with a recent Neanderthal ancestor Fu et al.

Neanderthals are thought to have disappeared in Europe approximately 39,000–41,000 years ago but they have contributed 1–3% of the DNA of present-day people in Eurasia. Here we analyse DNA from a 37,000–42,000-year-old modern human from Pestera cu Oase, Romania. Although the specimen contains small amounts of human DNA, we use an enrichment strategy to isolate sites that are informative about its relationship to Neanderthals and present-day humans. We find that on the order of 6–9% of the genome of the Oase individual is derived from Neanderthals, more than any other modern human sequenced to date. Three chromosomal segments of Neanderthal ancestry are over 50 centimorgans in size, indicating that this individual had a Neanderthal ancestor as recently as four to six generations back. However, the Oase individual does not share more alleles with later Europeans than with East Asians, suggesting that the Oase population did not contribute substantially to later humans in Europe.

I actually heard about this back in May in Cold Spring Harbor, but it is great to be able to actually read the paper.

What especially excites me about this paper is that it is hinting at admixture between Neanderthals and modern humans was pretty common. For some values of “pretty” and “common” at least. What we know so far about Neanderthal admixture is that all non-Africans have a little but that Asians (and Native Americans) have a little more than Europeans — somewhat surprising — and that this most likely is because of a second admixture into Asians.

There was a suggestion that the higher level of Neanderthals in Asians was caused by less negative selection removing it from Asians but a couple of papers argues against that (see e.g. Vernon & Akey 2015).

So there was the admixture event ancestral to all Eurasians, another into Asians, and now a third admixture event.

This individual — and I guess this admixture event — didn’t leave much (if any) genes in extant populations, so it isn’t part of the admixture that we see the results of today, but it was an admixture event between modern humans and Neanderthals nevertheless.

There is a rule of thumb that says: “zero, one, or many” — either something never happens, it is rare enough that it only happens once, or it happens a lot. It is not a strong rule, but now that we have evidence for at least three admixture events it seems hard to argue that interbreeding between modern humans and Neanderthals was rare.

The strange thing now is how uniform the level of Neanderthals is in Eurasians after all. I think I would expect some geographic differences, but maybe such differences have just been evened out by later migration. I don’t know. I don’t really have a strong intuition for that.

Vernot, B., & Akey, J. M. (2015). Complex History of Admixture between Modern Humans and Neandertals. American Journal of Human Genetics.


28 May

Damn simple suffix array construction

As you may or may not know, suffix arrays are the key data structure for many string algorithms. Well, the key data structure is suffix trees but those take up way too much memory so we use suffix arrays instead because they can do the same things (with a little extra bookkeeping) while taking up a lot less memory.

We just finished our class on string algorithms this week and Storm (my co-teacher) and I were talking about simple suffix array constructions. There are many different algorithms for constructing suffix arrays. There are several algorithms for constructing suffix arrays in linear time — although in practise these are often slower than algorithms with worse big-O complexities — but we were just talking about very easy implementations of them.

In C, where strings are just pointers to to null-terminated byte buffers you can get a representation of all suffixes of a string — so all substrings that ends at the same point as the original string — by having a pointers into that string. So it is easy to represent the set of all suffixes as an array of pointers.

The suffix array is just an array that contains the rank (the order) of all suffixes, so if you can just soft such an array you have constructed the suffix array.

In C you can do that just by sorting the array of pointers.

The worst-case complexity of this is pretty bad — you have an O(n log n) complexity for sorting and each comparison could in principle take time n as well since you have to compare strings that on average n in length — but in practise you can compare strings a lot faster since they do not typically share long prefixes (so you can terminate the comparison quickly) and the n log n complexity doesn’t matter much when you are using an optimised search algorithm.

So you should be able to sort the suffixes in a few lines of code using qsort and strcmp.

You need a wrapper to make qsort work with strcmp, but other than that, it is really that simple.

Below is my implementation of that. It is a complete program that takes a string on the command line and outputs the suffix array for it.

I also made a version that reads a string from a file and on my laptop I can build a suffix array for a file that is around a hundred megabases in a minut or so (for a human chromosome) so it isn’t that bad in performance.

Here is my simple implementation:

05 Jan

Admixture thoughts

Ok, this is just admitting that I have been rather stupid in my ways of thinking until recently. Just sharing this so you don’t do the mistakes that I have been making.

For a little while I was working on a coalescent model for admixture where I was thinking in terms of tracing lineages back in an admixed population until those lineages moved into source populations where they could then coalesce with lineages there.

In my mind, the scenario looked like this

admixture-simple

It is a good model. Easy to model and easy to reason about. It is just very likely to be wrong.

If you have three populations, where one is admixed between two of them, how likely is it that the admixed population directly obtained genes from those two? Not bloody likely is what it is.

Much more likely is a scenario like this

admixture-realistic

The “source” populations are not directly the source populations; they are merely related to the source populations. And they are not necessarily equally related to them.

If you trace back the lineages from population C to the admixture time, you won’t be able to coalesce with lineages from A or B at that time. You can coalesce further back in time, when the admixed populations merge with population A and B — and that can happen a lot further back in time and at very different time points for A and B.

It isn’t that much harder to model it. You need to model that lineages from population C at some point gets separated into two different populations that now cannot coalesce, then have to wait a bit further before they can coalesce with lineages from A or B, but it isn’t that hard to model.

I just didn’t think about that, and now I feel really stupid.

Realising the full complexity of what you have to work with makes it all a lot more interesting, though.