## It is not all bad news

Okay, yeah, so I broke my iMac today, but there is also good news.  We just got another paper accepted, this time a conference paper at this year’s WABI.

Since it is on neighbour-joining, we weren’t that optimistic.  We’ve had problems publishing on this before, but this time it was very well received.

Accelerated neighbour-joining

M. Simonsen, T. Mailund and C.N.S. Pedersen

Abstract

The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criteria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.

It’s really Martin SImonsen’s work. He is a Mater’s student at BiRC and in one of our algorithmics courses the students were asked to implement neighbour-joining and try to speed it up.  Usually, they come up with some clever ideas, but they never before managed to beat my own version, QuickJoin.

Martin did come up with a faster approach.  Well, pretty close to, anyway.  With mine and Christian Storm’s help, we managed to fix a few things here and there, and speed his approach up to one that not only beats QuickJoin but also all other methods we could get our hands on.

QuickJoin uses a lot of tricks to speed up the search for nodes to join in the algorithm, but the data structures makes it slow on small data sets and also rather memory hungry.  Martin’s approach is much simpler and this helps it a lot in the small data sets and doesn’t seem to hurt it on the larger data sets.

As for QuickJoin, the trick is to only look at pairs of nodes that can potentially be joined and avoid looking at nodes that we can rule out as the next pair to be joined.

Instead of using quad-trees and various functions to rule out pairs, Simon simply sorts nodes in a way where most likely pairs are considered first, and such that we can recognize when new pairs will not be better than those we have already seen.  Read the actual paper — it is quite easy to understand the algorithm from there — if you want the details.

Simonsen, M., Mailund, T., Pedersen, C.N. Accelerated neighbour-joining. Proceedings of WABI 2008

## Death of a Mac…

My new iMac died on me today.  It crashed unexpectedly three times — and it has never crashed on me before — and after the last crash it didn’t come back up.  I cannot even turn it on anymore, so I guess it is a hardware problem.

I guess the warranty is still good — I bought it this year and there should be one year as I recall — so I’ll send it back to have it fixed.  It still annoys the hell out of me, ’cause I was just getting used to using a Mac instead of Linux.

Oh well, at least I get to test if my Time Machine  backup is working.

## Investigating Selection on Viruses: A Statistical Alignment Approach

Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn’t one of the papers I’ve been bitching about — this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter — who does work with this — and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I’ll add a link to the paper as soon as it is up at the journal.

### What’s the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, ’cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won’t work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won’t scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth’s recent approach might have worked — it is the paper I blogged about a little back — but we didn’t know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

### What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time…

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn’t really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work…

de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics

## Computer Science Day and yet another talk…

The coming Friday we have our annual Computer Science Day at Aarhus University.  This is essentially the same as the Biology Day a few weeks back where I also gave a talk, but now at the Dept. of Computer Science, of course.

The point of these days is to present our groups and our work to the other groups in the department (and to students and anyone interested in general).  So it is a short presentation of the group followed by a presentation of one or two of our recent projects.

Working in Bioinformatics, I guess both biology and computer science considers it part of their research, although really BiRC is a separate research center.  Just with people from both backgrounds (and others).

I am going to talk about the usual association mapping stuff I always talk about these days.

I’ve put a lot of pictures in my slides, so it is a large download, but you can get it as PDF, PowerPoint or Keynote if you want to download the presentation. A Falsh version is here:

At least this is the last talk in a while, so now I can finally get some actual work done again…