Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data

Yesterday, Bioinformatics accepted this paper of mine:

Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data

Z. Ding, T. Mailund and Y.S. Song


Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.

It is an extension of our Blossoc method.  Blossoc is an association mapping method that constructs perfect phylogenies (trees compatible with the genotypes) along the genome and tests if these trees tend to cluster cases and controls.  If they do, we see it as evidence for local genotypes associated with the disease.

In the original Blossoc paper we used a very simple algorithm for inferring the local phylogenies. This algorithm only works for phased data, however, and real data is never phased.  The phase has to be inferred, and that is more time consuming than the actual association test.

In the new paper we use an algorithm that can construct the phylogenies directly from unphased data.

One problem, that we haven’t quite solved yet, is that we are working with perfect phylogenies — trees that exactly matches the genotypes — and such trees are only possible to construct for very short intervals in genotype data.  At least with genome wide association study data with thousands of individuals.

We still have some hacks to get around that, but those only work with phased data again, so even with our new method we fall back on inferring the phased information — although only locally — before we build some of the trees.

This is, by far, the most time consuming part of the algorithm right now, so I hope in the future we can improve on this.  Maybe construct “nearly perfect phylogenies” or something from unphased data.  I’m not sure how to get there, but I think it is worth researching…

The science of tapping beer cans

I’m linking to a Danish page here, sorry to those of you who do not quite master that language yet.

It concerns the age old question of whether you prevent a shaken beer can from spilling when you open it, if you tap the can on the top first.  A lot of people do this, but does it have any effect?

Klaus Seiersen — incidently an old drinking buddy of mine from the physics department — did an experiment with 30 cans and didn’t see any correlation with tapping and amount spilled.

You spill your beer when it foams out of the can, and the foam is caused by the CO2 in the beer.  When you shake the can, the CO2 gets mixed with the beer, and the beer foams. To prevent this, all you have to do is to wait until the CO2 settles in the top of the can again…

NGS whitepaper from CLC bio … maybe

According to this press release, CLC bio has released a whitepaper describing their assembly algorithm.  I’m very interested in reading it.  The speed sounds impressive, and assembly is an interesting algorithmic problem.  There’s just one problem: the link to the paper isn’t a link to a paper at all!  It’s a web form that lets you apply for it … and after filling out the form the kindly tell you that they’d get back to you.

Sorry guys, this is too lame!  Just give me the damn paper!

How deterministic is neighbour-joining anyway?

Whenever we try to publish an algorithm for speeding up neighbour-joining, at least one reviewer will ask us if we have checked that our new algorithm generates the same tree as the “canonical” neighbour-joining algorithm does.

This sounds reasonable, but it isn’t really.  The neighbour-joining method is not as deterministic as you’d think.

Well, it is a deterministic algorithm in the sense that the same input will always produce the same output, but equivalent input will not necessarily produce the same output.

The neighbour-joining algorithm

The neighbour-joining algorithm is phylogenetic inference method that takes as input a matrix of pairwise distances between taxa and outputs a tree connecting the taxa. It is a heuristic for producing “minimum evolution” trees, that is trees where the total branch length is as small as possible (while still, of course, reflecting the distances between the input taxa).  A heuristic is needed, since optimising the minimal evolution criteria is computationally intractable, and the neighbour-joining algorithm is a heuristic that strikes a good balance between computational efficiency and inference accuracy.

The heuristic is not a “heuristic optimisation algorithm” in the sense that simulated annealing or evolutionary algorithms is.  It is not randomised. It is a greedy algorithm, that in each step makes an optimal choice, which will hopefully eventually leads to a good solution (although locally optimial choices are not guaranteed to lead to a global optimal choice).

A problem with a greedy algorithm is that local choices can drastically change the final result.  Not all greedy algorithms, of course.  Some are guaranteed to reach a global optimum and if there is only a single global optimum then the local choices will not affect the final outcome (but if there is more than one global optimum the local choices will also influence the final result).

In the neighbour-joining algorithm, the greedy step consists of joining two clusters that minimise a certain criteria.  If two or more pairs minimise the criteria, a choice must be made, and that choice will affect the following steps.

Equivalent distance matrices  does not mean identical input

Unless the choice is resolved randomly, and I’m assuming that it isn’t, then the algorithm is deterministic.  It could, for example, always choose the first or last pair minimising the criteria.  How this choice should be resolved is not specified in the algorithm, so anything goes, really.

If the choice is resolved by picking the first or last pair, then exact form of the input matters.

The distance matrix neighbour-joining uses as input gives us all the pair-wise distances between our taxa, but to represent it we need to order the taxa to get row and column numbers.  This ordering is completely arbitrary, and any ordering gives us an equivalent distance matrix. Not an identical one, though.

If we resolve the local choice in a way that depend on the row and column order, then two equivalent input matrices might produce very different output trees.

An experiment

To see how much this matters in practice, I made a small experiment.  I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa.  You can see the matrices here.  There are 39 of them, but I had to throw two of them away —  AIRC and Acyl_CoA_thio — since they had taxons with the same name, and with that I cannot compare the resulting trees.  It’s matrices I created for this paper, but I don’t remember the details… it is not important for the experiment, though.

RF distance for “identical” NJ treesFor each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbour-joining trees and finally computed the Robinson-Foulds between all pairs of these trees.

To see a plot of the result, click on the figure on the left.

As you can see, you can get quite different trees out of equivalent distance matrices and on realistic data too.

I didn’t check if the differences between the trees are correlated with how much support the splits have in the data.  Looking at bootstrap values for the trees might tell us something about that.  Anyway, that is not the take home message I want to give here.  That is that it doesn’t necessarily make sense to talk the neighbour-joining tree for a distance matrix.

Neighbour-joining is less deterministic than you might think!