Crappy review report

We just got reviews back from a paper we submitted a few months ago on an algorithmic improvement on the neighbour joining method.  One of them was this:

For large-scale phylogeny reconstruction parsimony and likelihood are the preferred methods. Both are more accurate than neighbor joining (particularly large datasets). It is not clear to me if RapidDiskNJ is a sufficient advance to publish in this journal.

That’s it. The complete review.

Not a single comment on the actual contents of the paper.  Just a blank rejection of the neighbour joining method in general.

And this is an algorithmic improvement.  The paper is on speed and on handling really large data sets, with tens or hundred of thousands of taxa.  Parsemony and likelihood methods simply do not scale to those data sizes.  At least not unless you very large clusters to the problem we solve on a desktop computer.

We improve the speed by several orders of magnitude compared to other implementations of the method, and in comparison with likelihood methods there really is no competition at all.

With reviews like this, peer reviewing just fails!

187-190=-3

Rapid NJ

Our new neighbour-joining paper (I wrote about it here, but it was called “Accelerated Neighbour-Joining” then) just came online:

Rapid Neighbour-Joining

Martin Simonsen, Thomas Mailund  and Christian N. S. Pedersen

Abstract
The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criterion for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour-joining implementations.

I have put the source code plus binaries for Linux (Debian and RedHat), Windows and Mac OS X (i386) on this web page.

How deterministic is neighbour-joining anyway?

Whenever we try to publish an algorithm for speeding up neighbour-joining, at least one reviewer will ask us if we have checked that our new algorithm generates the same tree as the “canonical” neighbour-joining algorithm does.

This sounds reasonable, but it isn’t really.  The neighbour-joining method is not as deterministic as you’d think.

Well, it is a deterministic algorithm in the sense that the same input will always produce the same output, but equivalent input will not necessarily produce the same output.

The neighbour-joining algorithm

The neighbour-joining algorithm is phylogenetic inference method that takes as input a matrix of pairwise distances between taxa and outputs a tree connecting the taxa. It is a heuristic for producing “minimum evolution” trees, that is trees where the total branch length is as small as possible (while still, of course, reflecting the distances between the input taxa).  A heuristic is needed, since optimising the minimal evolution criteria is computationally intractable, and the neighbour-joining algorithm is a heuristic that strikes a good balance between computational efficiency and inference accuracy.

The heuristic is not a “heuristic optimisation algorithm” in the sense that simulated annealing or evolutionary algorithms is.  It is not randomised. It is a greedy algorithm, that in each step makes an optimal choice, which will hopefully eventually leads to a good solution (although locally optimial choices are not guaranteed to lead to a global optimal choice).

A problem with a greedy algorithm is that local choices can drastically change the final result.  Not all greedy algorithms, of course.  Some are guaranteed to reach a global optimum and if there is only a single global optimum then the local choices will not affect the final outcome (but if there is more than one global optimum the local choices will also influence the final result).

In the neighbour-joining algorithm, the greedy step consists of joining two clusters that minimise a certain criteria.  If two or more pairs minimise the criteria, a choice must be made, and that choice will affect the following steps.

Equivalent distance matrices  does not mean identical input

Unless the choice is resolved randomly, and I’m assuming that it isn’t, then the algorithm is deterministic.  It could, for example, always choose the first or last pair minimising the criteria.  How this choice should be resolved is not specified in the algorithm, so anything goes, really.

If the choice is resolved by picking the first or last pair, then exact form of the input matters.

The distance matrix neighbour-joining uses as input gives us all the pair-wise distances between our taxa, but to represent it we need to order the taxa to get row and column numbers.  This ordering is completely arbitrary, and any ordering gives us an equivalent distance matrix. Not an identical one, though.

If we resolve the local choice in a way that depend on the row and column order, then two equivalent input matrices might produce very different output trees.

An experiment

To see how much this matters in practice, I made a small experiment.  I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa.  You can see the matrices here.  There are 39 of them, but I had to throw two of them away —  AIRC and Acyl_CoA_thio — since they had taxons with the same name, and with that I cannot compare the resulting trees.  It’s matrices I created for this paper, but I don’t remember the details… it is not important for the experiment, though.

RF distance for “identical” NJ treesFor each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbour-joining trees and finally computed the Robinson-Foulds between all pairs of these trees.

To see a plot of the result, click on the figure on the left.

As you can see, you can get quite different trees out of equivalent distance matrices and on realistic data too.

I didn’t check if the differences between the trees are correlated with how much support the splits have in the data.  Looking at bootstrap values for the trees might tell us something about that.  Anyway, that is not the take home message I want to give here.  That is that it doesn’t necessarily make sense to talk the neighbour-joining tree for a distance matrix.

Neighbour-joining is less deterministic than you might think!

It is not all bad news

ResearchBlogging.orgOkay, yeah, so I broke my iMac today, but there is also good news.  We just got another paper accepted, this time a conference paper at this year’s WABI.

Since it is on neighbour-joining, we weren’t that optimistic.  We’ve had problems publishing on this before, but this time it was very well received.

Accelerated neighbour-joining

M. Simonsen, T. Mailund and C.N.S. Pedersen

Abstract

 The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criteria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.

It’s really Martin SImonsen’s work. He is a Mater’s student at BiRC and in one of our algorithmics courses the students were asked to implement neighbour-joining and try to speed it up.  Usually, they come up with some clever ideas, but they never before managed to beat my own version, QuickJoin.

Martin did come up with a faster approach.  Well, pretty close to, anyway.  With mine and Christian Storm’s help, we managed to fix a few things here and there, and speed his approach up to one that not only beats QuickJoin but also all other methods we could get our hands on.

QuickJoin uses a lot of tricks to speed up the search for nodes to join in the algorithm, but the data structures makes it slow on small data sets and also rather memory hungry.  Martin’s approach is much simpler and this helps it a lot in the small data sets and doesn’t seem to hurt it on the larger data sets.

As for QuickJoin, the trick is to only look at pairs of nodes that can potentially be joined and avoid looking at nodes that we can rule out as the next pair to be joined.

Instead of using quad-trees and various functions to rule out pairs, Simon simply sorts nodes in a way where most likely pairs are considered first, and such that we can recognize when new pairs will not be better than those we have already seen.  Read the actual paper — it is quite easy to understand the algorithm from there — if you want the details.


Simonsen, M., Mailund, T., Pedersen, C.N. Accelerated neighbour-joining. Proceedings of WABI 2008

You know, people do use neighbour joining!

Over the last couple of years, I have done a little work on phylogeny inference, including a few papers on neighbour joining.  One thing that consistently happens when you submit a paper on this — and I bring it up because I have just gotten back reviewer reports on such a  paper — is that at least one reviewer will tell you that neighbour joining is not interesting and one should focus on maximum likelihood / Bayesian trees instead.

Sorry to say it, but people do use neighbour joining — I am willing to bet that there are ten times as many people using neighbour joining to infer trees than there are people using the statistical approaches — so algorithmical improvements here do matter!

The statistical approaches are usually more accurate, and they are better at capturing the uncertainty in the inference and such, but they are slow! Not slow as in, “I’ll go get a cup of coffee while the program finish”, but slow as in “I’ll look at the tree when I am back from my vacation”.

Sure, they are fast enough for tens of leaves, but some people infer trees with thousands of leaves.  I recently got an email from a guy who tried with tens of thousands of leaves and ran out of memory using one of my tools — it needed more than 4G so it chocked on the problem (but a student in our lap has now come up with a new algorithm that is less memory expensive so that should solve that problem).

For large trees, forget about ML or Bayesian approaches.  They do not scale (yet).

People do use neighbour joining, so shut up and review the paper for what it is, not what you want it to be. Grrr!