Posts Tagged ‘Research’

How deterministic is neighbour-joining anyway?

Wednesday, July 23rd, 2008

Whenever we try to publish an algorithm for speeding up neighbour-joining, at least one reviewer will ask us if we have checked that our new algorithm generates the same tree as the “canonical” neighbour-joining algorithm does.

This sounds reasonable, but it isn’t really.  The neighbour-joining method is not as deterministic as you’d think.

Well, it is a deterministic algorithm in the sense that the same input will always produce the same output, but equivalent input will not necessarily produce the same output.

The neighbour-joining algorithm

The neighbour-joining algorithm is phylogenetic inference method that takes as input a matrix of pairwise distances between taxa and outputs a tree connecting the taxa. It is a heuristic for producing “minimum evolution” trees, that is trees where the total branch length is as small as possible (while still, of course, reflecting the distances between the input taxa).  A heuristic is needed, since optimising the minimal evolution criteria is computationally intractable, and the neighbour-joining algorithm is a heuristic that strikes a good balance between computational efficiency and inference accuracy.

The heuristic is not a “heuristic optimisation algorithm” in the sense that simulated annealing or evolutionary algorithms is.  It is not randomised. It is a greedy algorithm, that in each step makes an optimal choice, which will hopefully eventually leads to a good solution (although locally optimial choices are not guaranteed to lead to a global optimal choice).

A problem with a greedy algorithm is that local choices can drastically change the final result.  Not all greedy algorithms, of course.  Some are guaranteed to reach a global optimum and if there is only a single global optimum then the local choices will not affect the final outcome (but if there is more than one global optimum the local choices will also influence the final result).

In the neighbour-joining algorithm, the greedy step consists of joining two clusters that minimise a certain criteria.  If two or more pairs minimise the criteria, a choice must be made, and that choice will affect the following steps.

Equivalent distance matrices  does not mean identical input

Unless the choice is resolved randomly, and I’m assuming that it isn’t, then the algorithm is deterministic.  It could, for example, always choose the first or last pair minimising the criteria.  How this choice should be resolved is not specified in the algorithm, so anything goes, really.

If the choice is resolved by picking the first or last pair, then exact form of the input matters.

The distance matrix neighbour-joining uses as input gives us all the pair-wise distances between our taxa, but to represent it we need to order the taxa to get row and column numbers.  This ordering is completely arbitrary, and any ordering gives us an equivalent distance matrix. Not an identical one, though.

If we resolve the local choice in a way that depend on the row and column order, then two equivalent input matrices might produce very different output trees.

An experiment

To see how much this matters in practice, I made a small experiment.  I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa.  You can see the matrices here.  There are 39 of them, but I had to throw two of them away –  AIRC and Acyl_CoA_thio — since they had taxons with the same name, and with that I cannot compare the resulting trees.  It’s matrices I created for this paper, but I don’t remember the details… it is not important for the experiment, though.

RF distance for “identical” NJ treesFor each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbour-joining trees and finally computed the Robinson-Foulds between all pairs of these trees.

To see a plot of the result, click on the figure on the left.

As you can see, you can get quite different trees out of equivalent distance matrices and on realistic data too.

I didn’t check if the differences between the trees are correlated with how much support the splits have in the data.  Looking at bootstrap values for the trees might tell us something about that.  Anyway, that is not the take home message I want to give here.  That is that it doesn’t necessarily make sense to talk the neighbour-joining tree for a distance matrix.

Neighbour-joining is less deterministic than you might think!

How do you calibrate the molecular clock?

Thursday, May 29th, 2008

How do you calibrate the molecular clock — where you need a few known sequence divergence times — when you only know a few speciation times?

Yesterday at a meeting (I’m not sure I can tell you which meeting; I’m not sure how open it is supposed to be :-/) we discussed the divergence time of human-orangutan and human-macaque. We need the sequence divergence time to calibrate a CoalHMM model for figuring out some speciation and population genetics parameters of ancestral species.

No definitive answer came up at the meeting, but there was a short discussion by email after the meeting. This paper was sent around, where the divergence times were estimated to 25MYA and 13MYA, respectively, although the last of those numbers is actually the calibration point used in the analysis, so it is an assumption more than an estimate.

The problem is, the 13MYA used for the calibration is based on fossil evidence, and as far as I can see, that would make it an estimate for the speciation time between human and orangutan. We need the sequence divergence time. Speciation time and divergence time can vary with millions of years (if the effective population size is large enough).

If 13MYA is the divergence time between human and orangutan, we get a speciation time that is unrealistically recent.  If the divergence time is 18MYA instead, as we assumed in this paper, we would get a speciation time around 12MYA which would match the MBE paper.

But how do you figure out the divergence time needed to calibrate the clock?  Is there any way to get it, rather than the speciation time, from fossil evidence?

For our purposes, I suppose we can just as well work with speciation times for our calibration, but not everyone is using CoalHMMs for their analysis, so how do you deal with this problem?

New “research” web-pages at BiRC

Tuesday, May 6th, 2008

Yesterday I updated the “Research” page at the BiRC website. We have been wanting to update the description of our research for a while. It is after all an important part of the website of a research institution, and ours was hopelessly outdated and a mess of small and large projects, some we were actively working on and some that were little more than areas we were interested in.

A year ago, we had a big meeting and discussed, all of BiRC, for an entire afternoon how to update the pages. It really goes without saying that nothing could be decided in such a large group. Instead, around Christmas, Storm, our director, called a meeting inviting those interested in updating our web-pages and willing to put in the work to implement the changes. Requiring that people should back up their ideas with actual work reduced the group significantly. We were only three, myself, Enette and Storm.

I got the task of coming up with a draft of the research pages and present it to the rest of BiRC.

What is a research project and what goes on a research center’s homepage?

The first thing I had to decide was what I wanted on the page. The form of the content is usually easier to figure out once you know what the content is.

On a webpage describing the research going on at BiRC, I wanted it to be the actual research going on at BiRC. This sounds obvious, but I have seen many research institutions listing their research interests with lists I find completely unbelievable. You might be interested in a lot of different problems, but there is a limit to how many fields you can make an active contribution to. I find it dishonest if you claim to be doing research in field where, really, all you have done is publish a paper related to the field five years ago, and you occasionally read a paper about the topic. Too often, that is what I have noticed.

There is nothing wrong with focusing on a few research areas at a research institution. Sure, it is fun to do a few projects outside your main research area from time to time, but if people read the web pages to find out what you are doing at the institution, then they should find the area where you are spending the majority of your time, not something you think about every second year. If a post doc wants to come and work at BiRC, he should know what kind of work we actually do, not what kind of work we like to read about.

Anyway, this is turning into a bit of a rant, but that is the kind of thoughts I was having at the time.

So, I wanted the page to contain our active research areas, and I wanted to back up the “we are actively working on this stuff” claim we are implicitly making when describing research on our web pages.

How do you prove that you are active? You put up the money or the work! Ideally, you want to prove that you are doing research in an area, by showing that you are actively publishing in it. If you do not publish, are you really doing research there? If you are, is it any good? I wanted each project to be able to show at least two-three papers a year to be considered active. I made an exception if there was funding for a project, but it hadn’t really produced any results yet… I am still not sure this is a good idea, but my arm was twisted a bit to allow for this…

Designing the web pages

After these considerations, I started to design the pages. I made an overview page, where people can scan through all the research areas at BiRC very quickly, and with links to sub-pages — one for each research area — that can contain more information.
On the main page I wanted a short description of each research area and a list of the most recent papers published in it (at least three, is my initial choice, but more if there are more papers published within the last year). This works as a sort of sanity check for whether we are describing an active research area or not, and also tells the world what kind of papers we publish in each area.

Putting it together

This design I presented to BiRC and no one objected. Good. Then I asked people to send me the projects (description and publication lists) they wanted on the page. Deadline March 15, which would give me time to set up the sub-pages and such, and then the new page could be put on the on May 1.

Naturally, but March 16 I hadn’t received anything and by May 1 I had a few half-baked descriptions.  Deadlines are rarely taken that serious when they are internal deadlines like this.  I am pretty bad at meeting them myself, so I cannot really complain.

With a lot of threats I got the material I needed over the weekend — well, enough to make descriptions of the main research areas at BiRC — and I put the pages together.  I’m personally maintaining more of them than I had planned.  I really only wanted to be responsible for the Association Mapping page.  This is my own main area, and the one where I consider myself the local expert.  I ended up maintaining virtually all the pages where I am contributing anything.  I plan to push the responsibility to someone else when I get the chance, but for now this is how it is.

Software roadmap

Sunday, December 30th, 2007

I’ve put a roadmap for our association mapping software up on my BiRC homepage. It is a bit of a mix of my old homepage design and some php to synchronize it with our bug database. I really don’t know php, so I’m not sure it is an ideal design. It is only php because we use Mantis for our bug database. I really don’t want to write my own bug database, so that is the way it is.