I’m starting a journal club for the computer scientists at BiRC. Since it is me starting it up, I have egoistically chosen a subject I personally would like to learn more about. Algorithms for Next Generation Sequencing assembly.
Now I don’t even know much about the “plain old” sequence assembly algorithms, so I want to start out with a few papers on that and then move to newer algorithms.
The only problem is that I don’t know which papers are important and which are not.
So if anyone have any ideas, please please let me know!
Over the last couple of years, I have done a little work on phylogeny inference, including a few papers on neighbour joining. One thing that consistently happens when you submit a paper on this — and I bring it up because I have just gotten back reviewer reports on such a paper — is that at least one reviewer will tell you that neighbour joining is not interesting and one should focus on maximum likelihood / Bayesian trees instead.
Sorry to say it, but people do use neighbour joining — I am willing to bet that there are ten times as many people using neighbour joining to infer trees than there are people using the statistical approaches — so algorithmical improvements here do matter!
The statistical approaches are usually more accurate, and they are better at capturing the uncertainty in the inference and such, but they are slow! Not slow as in, “I’ll go get a cup of coffee while the program finish”, but slow as in “I’ll look at the tree when I am back from my vacation”.
Sure, they are fast enough for tens of leaves, but some people infer trees with thousands of leaves. I recently got an email from a guy who tried with tens of thousands of leaves and ran out of memory using one of my tools — it needed more than 4G so it chocked on the problem (but a student in our lap has now come up with a new algorithm that is less memory expensive so that should solve that problem).
For large trees, forget about ML or Bayesian approaches. They do not scale (yet).
People do use neighbour joining, so shut up and review the paper for what it is, not what you want it to be. Grrr!
Today we published a journal version of a computer science paper from APBC 2006 (in a special issue volume of the best papers selected from the conference). We (the same group of authors) had two papers at the conference, both related to calculating something called the quartet distance between trees.
M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg
Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.
Abstract
We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.
M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 101–110, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.
Abstract
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d9n log n) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.
a much more technical paper that I won’t describe in details here, but you can see my presentation from the conference (of both papers) here:
What is the quartet distance?
The quartet distance is a way of measuring the similarity (or distance) between two trees. There are various different measures, with different properties, and this is one of them.
The idea is that whenever you select a subset of the leaves of the tree, a subset of four leaves (a quartet) you can consider the tree you would get if you removed every other leaf from the tree. The topology of this sub-tree is the quartet topology, and the four possible quartet topologies are shown on the left. For two trees, and a given quartet, you can consider the quartet topologies in the two trees and compare them to see if they are the same or if they are different. If you do this for all quartets, and count the number of topologies that differ, what you get is the quartet distance between the two trees.
Calculating the quartet distance
There are n choose four quartets in a tree with n leaves, so if you enumerate them all from both trees and (somehow) figure out their topology in constant time, you can sort the list (using radix sort) and compare them in time proportional to the number of quartets, all in all giving you a O(n4) algorithm. Much too slow for any reasonably large n.
If we only consider binary trees, we can do better. (The case of non-binary trees is a bit tricky, but we have a couple of papers about that problem that I might come back to in a later post).
The trick is to build a table with an entry for each pair of sub-trees, one from either of the two input trees. The table will have n2 entries (there are n sub-trees in each of the two input trees) and will contain the number of leaves shared by the two sub-trees. Using a dynamic programming algorithm (and using that the trees are binary) it possible to fill out this table in time O(n2) time, and from this table we can count the number of shared quartets between the two input trees (see the expression in the figure on the left). We can then get the quartet distance by subtracting this number from the total number of quartets — n choose four. Viola, an O(n2) algorithm.
Using a colouring scheme, a lots of tricks and clever data structures, Brodal et al. improved on this to get a O(n log n) algorithm, but there is a large overhead to each comparison in that, so the simpler algorithm is usually faster. Both algorithms are implemented in my software QDist, and a comparison of them, for growing n can be seen on the left. (Actually, the implementation in QDist runs in O(n log2n), but the extra factor of log n doesn’t pay off in any way in the real running time).
As you can see on the plot, you need n above 2000 to get any benefit of the more complex algorithm.
Calculating the quartet distance between all pairs of a set of trees
Now, if you have a set of k trees and you want to know the quartet distance between each pair, you would need O(k2n2) using the O(n2) algorithm.
This, however, ignores that you will know something about the trees you are comparing when you have seen them in comparisons with other trees from the set. Whenever they share sub-trees, the quartets from those sub-trees will always be shared, so running through the trees and counting them again is really a waste of effort.
To exploit this, all you need is a data structure that captures shared trees.
The trick we used was to root the trees (in an arbitrary leaf) after which we could merge them into a directed acyclic graph (DAG). It was then a small matter of modifying the two algorithms mentioned above to work on the DAG instead of trees.
The speedup depends a lot on how much the DAG compresses the original trees, but for trees that are reasonably similar (the case where calculating distances between them is most interesting), the compression turns out to be pretty good, and we could get very significant improvements. Improvements of more than 100 in running time, for the cases where we ran our experiments.
Stissing, M., Mailund, T., Pedersen, C.N., Brodal, G.S., Fagerberg, R. (2008). Computing the all-pairs quartet distance on a set of evolutionary trees. Journal of Bioinformatics and Computational Biology, 6(1), 37-50.