Today we published a journal version of a computer science paper from APBC 2006 (in a special issue volume of the best papers selected from the conference). We (the same group of authors) had two papers at the conference, both related to calculating something called the quartet distance between trees.
The paper selected for the journal was:
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
The other paper was
M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 101–110, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d9n log n) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.
a much more technical paper that I won't describe in details here, but you can see my presentation from the conference (of both papers) here:
What is the quartet distance?
The quartet distance is a way of measuring the similarity (or distance) between two trees. There are various different measures, with different properties, and this is one of them.
The idea is that whenever you select a subset of the leaves of the tree, a subset of four leaves (a quartet) you can consider the tree you would get if you removed every other leaf from the tree. The topology of this sub-tree is the quartet topology, and the four possible quartet topologies are shown on the left. For two trees, and a given quartet, you can consider the quartet topologies in the two trees and compare them to see if they are the same or if they are different. If you do this for all quartets, and count the number of topologies that differ, what you get is the quartet distance between the two trees.
Calculating the quartet distance
There are n choose four quartets in a tree with n leaves, so if you enumerate them all from both trees and (somehow) figure out their topology in constant time, you can sort the list (using radix sort) and compare them in time proportional to the number of quartets, all in all giving you a O(n4) algorithm. Much too slow for any reasonably large n.
If we only consider binary trees, we can do better. (The case of non-binary trees is a bit tricky, but we have a couple of papers about that problem that I might come back to in a later post).
The trick is to build a table with an entry for each pair of sub-trees, one from either of the two input trees. The table will have n2 entries (there are n sub-trees in each of the two input trees) and will contain the number of leaves shared by the two sub-trees. Using a dynamic programming algorithm (and using that the trees are binary) it possible to fill out this table in time O(n2) time, and from this table we can count the number of shared quartets between the two input trees (see the expression in the figure on the left). We can then get the quartet distance by subtracting this number from the total number of quartets -- n choose four. Viola, an O(n2) algorithm.
Using a colouring scheme, a lots of tricks and clever data structures, Brodal et al. improved on this to get a O(n log n) algorithm, but there is a large overhead to each comparison in that, so the simpler algorithm is usually faster. Both algorithms are implemented in my software QDist, and a comparison of them, for growing n can be seen on the left. (Actually, the implementation in QDist runs in O(n log2 n), but the extra factor of log n doesn't pay off in any way in the real running time).
As you can see on the plot, you need n above 2000 to get any benefit of the more complex algorithm.
Calculating the quartet distance between all pairs of a set of trees
Now, if you have a set of k trees and you want to know the quartet distance between each pair, you would need O(k2n2) using the O(n2) algorithm.
This, however, ignores that you will know something about the trees you are comparing when you have seen them in comparisons with other trees from the set. Whenever they share sub-trees, the quartets from those sub-trees will always be shared, so running through the trees and counting them again is really a waste of effort.
To exploit this, all you need is a data structure that captures shared trees.
The trick we used was to root the trees (in an arbitrary leaf) after which we could merge them into a directed acyclic graph (DAG). It was then a small matter of modifying the two algorithms mentioned above to work on the DAG instead of trees.
The speedup depends a lot on how much the DAG compresses the original trees, but for trees that are reasonably similar (the case where calculating distances between them is most interesting), the compression turns out to be pretty good, and we could get very significant improvements. Improvements of more than 100 in running time, for the cases where we ran our experiments.
Stissing, M., Mailund, T., Pedersen, C.N., Brodal, G.S., Fagerberg, R. (2008). Computing the all-pairs quartet distance on a set of evolutionary trees. Journal of Bioinformatics and Computational Biology, 6(1), 37-50.