New paper on computing the quartet distance

We just got another paper accepted:

A quadratic time algorithm for computing the quartet distance between two general trees T. Mailund, J. Nielsen, and C.N.S. Pedersen

To be presented at the Workshop on Bioinformatics Algorithms at IJCBS 2009 August 2009, Shanghai, China

Abstract

We derive a quadratic time and space algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is quadratic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm for computing the quartet distance between general trees independent of the degree of the inner nodes.

The quartet distance

The quartet distance is a metric for comparing two phylogenetic trees over the same set of taxa.

It works as follows: For each sub set of taxa of size four (a quartet), \{a,b,c,d\} you consider the tree topology in the two trees.  The tree topology for \{a,b,c,d\} can be one of the four shown below.  Now, for each such subset you count how often the two phylogenetic trees agree on the quartet topology and how often they do not, and the number of topologies where they differ is the quartet distance between the two trees.

There are n \choose 4 quartets in a tree with n leaves, so explicitly enumerating the topologies and comparing them this way would take at least time in O(n^4) but through various tricks this can be reduced.

Algorithms for computing the quartet distance

The first algorithms for computing the quartet distance focused on binary trees.  For binary trees, the star topologi, topology (d) above, is not possible, and that simplifies the problem a bit.

The fastest algorithm for binary trees so far is time O(n\log n).  Incidentally, my very first bioinformatics paper was an application note about an implementation of pretty much that algorithm.  Not quite, my implementation was of an O(n\log^2n) algorithm, but the last \log n comes at a considerable code complexity, so I didn’t bother with it.

People didn’t seem to bother with non-binary trees for some reason.  I guess we are just used to trees being binary in bioinformatics that we don’t think about it much (even when we should, and when we shouldn’t resolve inner nodes when we are really not sure how to resolve them).

Anyway, here at BiRC we got interested in it because we had three Master’s students who wanted to do some algorithmics in bioinformatics and we asked them to look at the quartet distance and see what they could do we general trees.

That turned out extremely successfully.  They worked very well together, and in short time produced several algorithms and got them written down in papers:

The first paper describes two algorithms for computing the quartet distance between general degree trees.  One independent of the maximal degree, running in time O(n^3), the other dependant on it, running in time O(n^2d^2) where d is the maximal degree.  Both algorithms are dynamic programming algorithms.

The second paper extends the O(n\log{}n) algorithm for binary trees to general trees, but at a running time of O(d^9n\log n) — so sub quadratic in n but with a rather high complexity in d^9 (to say nothing of the “mental complexity” of the algorithm.  I would certainly not like to have to implement it).

The third paper uses a lot of combinatorical tricks to reduce the O(n^2d^2) algorithm from the first paper to O(n^2d).

Our new algorithm

Our new algorithm isn’t really that new at all.  It is an algorithm I came up with on the train back from a workshop where I had presented an earlier version of the algorithms in the first paper above.

I couldn’t get the complexity analysis right, though, and concluded that it was probably O(n^2d^2) so no improvement on what we already had.  Thus I just put it in a drawer and forgot about it.

At least until Christian forwarded a call-for-papers to me and asked me if I had any old results lying around.  I digged through my old abandoned ideas and found this one.  Since it is some rather simple combinatorics, similar to the dynamic programming algorithms we had in the very first paper, it might be possible come up with a simpler O(n^2d), algorithm than the one we had, I thought, or maybe just consider a sub-set of trees, like all d-ary (not necessarily binary) trees.

I didn’t have much time to put into it, but Jesper, a PhD student here at BiRC did, so I gave it to him.  He basically just re-did the complexity analysis and concluded that the running time was O(n^2) so without further ado we sent in the paper, and now we (well, most likely Jesper) will present it in Shanghai this summer.

94-116=-22

Tags: ,

2 Responses to “New paper on computing the quartet distance”

  1. Jonathan Badger Says:

    Hey, you actually cite a paper (Bryant et al, 2000) from the group (Ming Li’s) where I did my postdoc! Quartets — very nostalgic for me — we were really into them at the time. Nice to see people are still working on the subject.

  2. Thomas Mailund Says:

    The new algorithm is actually an extension of theirs.

    It is a dynamic programming algorithm where you use that you can count the number of leaves in the intersection of sub-trees from the two input trees. The problem with this approach is that to pick pairs of trees around an inner node, you have to pick (d choose 2) trees when the node has degree d. The trick is in avoiding that this takes too much time.

Leave a Reply