It is not all bad news
Okay, yeah, so I broke my iMac today, but there is also good news. We just got another paper accepted, this time a conference paper at this year's WABI.
Since it is on neighbour-joining, we weren't that optimistic. We've had problems publishing on this before, but this time it was very well received.
M. Simonsen, T. Mailund and C.N.S. Pedersen
Abstract
The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criteria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.
It's really Martin SImonsen's work. He is a Mater's student at BiRC and in one of our algorithmics courses the students were asked to implement neighbour-joining and try to speed it up. Usually, they come up with some clever ideas, but they never before managed to beat my own version, QuickJoin.
Martin did come up with a faster approach. Well, pretty close to, anyway. With mine and Christian Storm's help, we managed to fix a few things here and there, and speed his approach up to one that not only beats QuickJoin but also all other methods we could get our hands on.
QuickJoin uses a lot of tricks to speed up the search for nodes to join in the algorithm, but the data structures makes it slow on small data sets and also rather memory hungry. Martin's approach is much simpler and this helps it a lot in the small data sets and doesn't seem to hurt it on the larger data sets.
As for QuickJoin, the trick is to only look at pairs of nodes that can potentially be joined and avoid looking at nodes that we can rule out as the next pair to be joined.
Instead of using quad-trees and various functions to rule out pairs, Simon simply sorts nodes in a way where most likely pairs are considered first, and such that we can recognize when new pairs will not be better than those we have already seen. Read the actual paper -- it is quite easy to understand the algorithm from there -- if you want the details.
Simonsen, M., Mailund, T., Pedersen, C.N. Accelerated neighbour-joining. Proceedings of WABI 2008
June 19th, 2008 at 6:39 pm
Interesting. I use QuickTree as part of a pipeline for looking at metagenomic data, but this looks far faster -- I'll have to check it out. I hadn't realized that there was much more to be done with optimizing neighbor-joining.
June 19th, 2008 at 6:47 pm
I didn't think so either. There are lots of ways of approximating the search procedure to get something that is faster (and probably just as good as NJ), but actually getting the "real" NJ trees is tricky if you want to run faster than O(n3).
Actually, the main problem with QuickTree when running on data with very many taxa is that it is too memory hungry. It uses about four times as much memory as the simple NJ algorithm. It is only a constant, but for >10.000 taxa, that means you start swapping on a typical machine, and then it is painfully slow.
We were looking at ways to make the search IO efficient when Martin came up with his algorithm, but now we won't bother with that much.
Let me know if you want the code for Martin's NJ program. We'll put it on the web soon, but I don't think you can find it anywhere yet.
June 19th, 2008 at 6:51 pm
Sorry, when I wrote QuickTree above I meant QuickJoin. QuickTree is a fast implementation of the vanilla NJ algorithm. It is not that hard to beat with a bit of algorithmic engineering, though.
June 20th, 2008 at 7:37 am
The source code for our new program can be found here: http://www.daimi.au.dk/~kejseren/webpage/php/sortjoin.php