Fast admixture analysis and population tree estimation for SNP and NGS data

Jade Yu Cheng, Thomas Mailund, and Rasmus Nielsen

Bioinformatics (2017)

Motivation: Structure methods are highly used population genetic methods for classifying individuals in a sample fractionally into discrete ancestry components.
Contribution: We introduce a new optimization algorithm for the classical STRUCTURE model in a maximum likelihood framework. Using analyses of real data we show that the new method finds solutions with higher likelihoods than the state-of-the-art method in the same computational time. The optimization algorithm is also applicable to models based on genotype likelihoods, that can account for the uncertainty in genotype-calling associated with Next Generation Sequencing (NGS) data. We also present a new method for estimating population trees from ancestry components using a Gaussian approximation. Using coalescence simulations of diverging populations, we explore the adequacy of the STRUCTURE-style models and the Gaussian assumption for identifying ancestry components correctly and for inferring the correct tree. In most cases, ancestry components are inferred correctly, although sample sizes and times since admixture can influence the results. We show that the popular Gaussian approximation tends to perform poorly under extreme divergence scenarios e.g. with very long branch lengths, but the topologies of the population trees are accurately inferred in all scenarios explored. The new methods are implemented together with appropriate visualization tools in the software package Ohana.

KMP for comparison

For comparison with the previous post, this is how KMP would look like:

static void kmp_search(char * key, char * buffer)
    unsigned long n = strlen(buffer);
    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    unsigned long i, j; i = j = 0;
    while (i < n - m + j) {
        while (buffer[i] == key[j] && j < m) {
            ++i; ++j;
        if (j == m) printf("%lu\n", i - m);
        if (j == 0) ++i;
        else j = ba[j-1];

This algorithm is doing essentially the same thing as the border-array based algorithm, but the two loops look very different and you need to get both of the right. The previous algorithm also has two loops, but they are doing the same thing, so if you get one of the right you will very easily get the other right as well.

Beginner’s Guide to Todoist

Back over Easter Amir Salihefendic and I wrote a guide to Todoist, the todo list manager that his company makes. It took a little longer for us to finally make the guide public—Amir had some of his designers make a better cover than I had hacked together with my quite limited skills and had one of his guys copyedit it first—but now you can get it on Amazon and if you get it before June 4 you can even get it for free.

Preorder Advanced Object-Oriented Programming in R

You can now preorder Advanced Object-Oriented Programming in R: Statistical Programming for Data Science, Analysis, and Finance on Amazon.

I only just got through the technical reviews this week and it won’t be out until the autumn, but you can preorder it already.

I don’t actually write the book descriptions they put on Amazon, Apress does that, but this is what it says:

Learn how to write object-oriented programs in R and how to construct classes and class hierarchies in the three object-oriented systems available in R. This book gives an introduction to object-oriented programming in the R programming language and shows you how to use and apply R in an object-oriented manner. You will then be able to use this powerful programming style in your own statistical programming projects to write flexible and extendable software. After reading Advanced Object-Oriented Programming in R, you’ll come away with a practical project that you can reuse in your own analytics coding endeavors. You’ll then be able to visualize your data as objects that have state and then manipulate those objects with polymorphic or generic methods. Your projects will benefit from the high degree of flexibility provided by polymorphism, where the choice of concrete method to execute depends on the type of data being manipulated.
It is a short book, but it describes the different OO systems available in R, and while it doesn’t really show any really advanced use of objects and classes, as such, it does have examples of how to use object-orientation to write your own statistical models and it does have some advanced examples of using operator overloading to extend the language.

Meta-programming in R update

I sent the manuscript for Meta-programming in R to Apress a few days ago. Usually, there is a technical review and then I have to fix a few things before it goes into production. This time around, there were no changes required so it went directly to production. I am looking forward to seeing it in print. Because it is sold now, you cannot get it any longer on Amazon; you just have to wait.