New paper on computing the quartet distance

We just got another paper accepted:

A quadratic time algorithm for computing the quartet distance between two general trees T. Mailund, J. Nielsen, and C.N.S. Pedersen

To be presented at the Workshop on Bioinformatics Algorithms at IJCBS 2009 August 2009, Shanghai, China


We derive a quadratic time and space algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is quadratic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm for computing the quartet distance between general trees independent of the degree of the inner nodes.

The quartet distance

The quartet distance is a metric for comparing two phylogenetic trees over the same set of taxa.

It works as follows: For each sub set of taxa of size four (a quartet), $$\{a,b,c,d\}$$ you consider the tree topology in the two trees.  The tree topology for $$\{a,b,c,d\}$$ can be one of the four shown below.  Now, for each such subset you count how often the two phylogenetic trees agree on the quartet topology and how often they do not, and the number of topologies where they differ is the quartet distance between the two trees.

There are $$n \choose 4$$ quartets in a tree with $$n$$ leaves, so explicitly enumerating the topologies and comparing them this way would take at least time in $$O(n^4)$$ but through various tricks this can be reduced.

Algorithms for computing the quartet distance

The first algorithms for computing the quartet distance focused on binary trees.  For binary trees, the star topologi, topology (d) above, is not possible, and that simplifies the problem a bit.

The fastest algorithm for binary trees so far is time $$O(n\log n)$$.  Incidentally, my very first bioinformatics paper was an application note about an implementation of pretty much that algorithm.  Not quite, my implementation was of an $$O(n\log^2n)$$ algorithm, but the last $$\log n$$ comes at a considerable code complexity, so I didn’t bother with it.

People didn’t seem to bother with non-binary trees for some reason.  I guess we are just used to trees being binary in bioinformatics that we don’t think about it much (even when we should, and when we shouldn’t resolve inner nodes when we are really not sure how to resolve them).

Anyway, here at BiRC we got interested in it because we had three Master’s students who wanted to do some algorithmics in bioinformatics and we asked them to look at the quartet distance and see what they could do we general trees.

That turned out extremely successfully.  They worked very well together, and in short time produced several algorithms and got them written down in papers:

The first paper describes two algorithms for computing the quartet distance between general degree trees.  One independent of the maximal degree, running in time $$O(n^3)$$, the other dependant on it, running in time $$O(n^2d^2)$$ where $$d$$ is the maximal degree.  Both algorithms are dynamic programming algorithms.

The second paper extends the $$O(n\log{}n)$$ algorithm for binary trees to general trees, but at a running time of $$O(d^9n\log n)$$ — so sub quadratic in $$n$$ but with a rather high complexity in $$d^9$$ (to say nothing of the “mental complexity” of the algorithm.  I would certainly not like to have to implement it).

The third paper uses a lot of combinatorical tricks to reduce the $$O(n^2d^2)$$ algorithm from the first paper to $$O(n^2d)$$.

Our new algorithm

Our new algorithm isn’t really that new at all.  It is an algorithm I came up with on the train back from a workshop where I had presented an earlier version of the algorithms in the first paper above.

I couldn’t get the complexity analysis right, though, and concluded that it was probably $$O(n^2d^2)$$ so no improvement on what we already had.  Thus I just put it in a drawer and forgot about it.

At least until Christian forwarded a call-for-papers to me and asked me if I had any old results lying around.  I digged through my old abandoned ideas and found this one.  Since it is some rather simple combinatorics, similar to the dynamic programming algorithms we had in the very first paper, it might be possible come up with a simpler $$O(n^2d)$$, algorithm than the one we had, I thought, or maybe just consider a sub-set of trees, like all d-ary (not necessarily binary) trees.

I didn’t have much time to put into it, but Jesper, a PhD student here at BiRC did, so I gave it to him.  He basically just re-did the complexity analysis and concluded that the running time was $$O(n^2)$$ so without further ado we sent in the paper, and now we (well, most likely Jesper) will present it in Shanghai this summer.


Coolest paper in a while, and I feel left out…

This paper seems to be the coolest population genetics out in a while:

I have downloaded it, but I just haven’t had time to read it yet, and probably won’t have time to read it until the weekend.  I’m swamped with work, and I have three journal clubs Thursday and Friday, so the reading I do have time for goes to the papers for those.

I just can’t believe that I am postponing reading this paper to read about fungi genomics, of all things…


Figures and Word

I’m used to writing my papers in (La)TeX.  In mathematics and computer science, that is the tradition, and quite frankly it is the only choice if you include a lot of math in your documents.

Recently, though, I’m working with people used to writing in Word, and rather than forcing them to learn LaTeX I’ve decided to use Word as well.  Only when the math content is low, though.  Writing math in Word is too painful to even contemplate.

Anyway, for simple text editing I have no complains with Word.  It does what it is supposed to, and I find the grammar checker a great help (I do have a grammar checker in the editor I use for LaTeX, but it isn’t quite as good).

I run into problems when I try to do anything but simple text editing.

Like today, I am trying to insert a figure in a document.  I have a plot in PDF and I want to insert that together with a caption.

I can insert the figure and write the caption in a text box.  No problem.  It looks something like this:

Now, to make the editing easier, I want to group the figure with the caption, so I can move them around as a unit if I need to move them later on.

If I do that, it looks like this:

WTF? Why did it suddenly turn my (vector graphic) PDF into (a very pixilated) bitmap image?

This is so not what I expected, and certainly not what I wanted…

Rapid NJ

Our new neighbour-joining paper (I wrote about it here, but it was called “Accelerated Neighbour-Joining” then) just came online:

Rapid Neighbour-Joining

Martin Simonsen, Thomas Mailund  and Christian N. S. Pedersen

The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criterion for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour-joining implementations.

I have put the source code plus binaries for Linux (Debian and RedHat), Windows and Mac OS X (i386) on this web page.

Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data

Yesterday, Bioinformatics accepted this paper of mine:

Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data

Z. Ding, T. Mailund and Y.S. Song


Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.

It is an extension of our Blossoc method.  Blossoc is an association mapping method that constructs perfect phylogenies (trees compatible with the genotypes) along the genome and tests if these trees tend to cluster cases and controls.  If they do, we see it as evidence for local genotypes associated with the disease.

In the original Blossoc paper we used a very simple algorithm for inferring the local phylogenies. This algorithm only works for phased data, however, and real data is never phased.  The phase has to be inferred, and that is more time consuming than the actual association test.

In the new paper we use an algorithm that can construct the phylogenies directly from unphased data.

One problem, that we haven’t quite solved yet, is that we are working with perfect phylogenies — trees that exactly matches the genotypes — and such trees are only possible to construct for very short intervals in genotype data.  At least with genome wide association study data with thousands of individuals.

We still have some hacks to get around that, but those only work with phased data again, so even with our new method we fall back on inferring the phased information — although only locally — before we build some of the trees.

This is, by far, the most time consuming part of the algorithm right now, so I hope in the future we can improve on this.  Maybe construct “nearly perfect phylogenies” or something from unphased data.  I’m not sure how to get there, but I think it is worth researching…