Final version of two BMC Bioinformatics papers

My two latest BMC Bioinformatics papers are now available in their final form (before they were just there as preliminary PDFs).
The first is on the file format that I’ve blogged about earlier, while the second is an algorithmic paper from APBC’09.
J. Nielsen and T. Mailund
BMC Bioinformatics 2008, 9:526; doi:10.1186/1471-2105-9-526
S. Besenbacher, C.N.S. Pedersen and T. Mailund
BMC Bioinformatics, 10(Suppl 1): S74 doi:10.1186/1471-2105-10-S1-S74


I feel both very stupid and very clever this morning.

In our CoalHMM methodology we look for patterns of incomplete lineage sorting from which we can infer population genetics parameters from ancestral species and information about speciation times (as opposed to divergence times).

For a while now, more than a year, we’ve been working on a different parameterisation of the model that directly lets us work on the parameters of interest.  In the former model (the PLoS Genetics paper I linked to above) we extract the parameters in post processing.  In the new parameterisation we work on them directly and can easier put constraint on them to match prior knowledge to the model, such as present day effective population sizes or speciation times from paleontological data.

One of our post docs has completely re-implemented the model plus made it computationally tractable to scan complete genomes with it.  It’s been a lot of work, but we now have 3-4 papers in the pipeline; a couple of which within weeks of submission.  Which is good, since we haven’t published anything on our work for this long while.

Anyway, there is an issue in the mathematical model that has bothered me the whole while.  I won’t go into details here until the paper is out and you can check it yourself, but it has to do with some of the assumptions made to make the math tractable.  After all, we are reducing the rather complex process of coalescence with recombination to a Markov model along the sequence, which it isn’t reall, so some simplifications are necessary.

The process is a Markov chain when running in time, but when running along the sequence it is not since the history of each nucleotide depends on all the others to some degree.  It is just that this relationship drops off with distance, so we can approximate it somewhat with a Markov chain.

One of the simplifications we have to model the process along the sequence as a Markov chain is to assume that the past histories of two nucleotides are independent after they have been unlinked by a recombination.  This isn’t unreasonable for short time scales, but for longer time scales it is not quite right either; after a recombination two neighbouring nucleotides can coalesce back together again and the result when seen today is exactly as if they never recombined in the first place.

This way we get an invisible occurence, and if we ignore those we will underestimate recombination rate (because we observe fewer recombinations than really occur).

We know that this happenes, of course, but we sort of hack our way out of it.  It does mean that we will still underestimate recombination for long branches in the species phylogeny (as opposed to the coalescence genealogy where we have other problems as well).

I did some simulations that showed that it wasn’t much of a problem on the parameters we use.  We do see a bias that we have to correct for, but based on the simulations we decided that it probably wasn’t because of this effect and left it at that.

It still bugs me, so I’ve been thinking about dealing with the problem but got stuck.

Until yesterday, where I discussed it with Carsten Wiuf who has done a lot of work on coalescence theory.  After I explained the problem to him, he asked if I wasn’t just modelling a continuous time Markov chain.  Doh!

Yes, that is exactly what it is, and I knew that. I have even used that approach in a similar project.  I just didn’t think about it in this case.

You can easily model the process as such, and though it gets complicated with more lineages it is not too hard to do (as long as you still keep the “Markov along the sequence” assumption).

With just this one insight, I’ve managed last night to reformulate the problem plus make the math much simpler.  There are still a lot of details to get right, of course, but I am very optimistic about it.

It is just too bad that it is this late in the process.  We don’t have time to re-do a years work and still contribute to the data analysis projects we are involved with, but I hope that the new approach can improve the CoalHMM methodology in the future.


Matrix exponentiation in R

Why is there no matrix exponentiation in R?

I need it all the time, for dealing with continuous time Markov chains, but it just isn’t there!

Well, it is probably found in a lot of packages (maybe even some I have installed), but I feel it should be there as a built in function.

I usually end up implementing it myself using eigen value decomposition

mexp <- function(Q,t) {
  ## calculates matrix exponential using a simple eigen-value decomposition
  x <- eigen(Q)
  U <- x$vectors
  L <- diag(exp(t*x$values))
  return (U %*% L %*% solve(U))

but that isn’t that numerically stable, so it is a sub-optimal solution.

I guess it isn’t trivial to get right, but damn I miss it…


Collaborating using Google Docs

When collaborating on writing a paper, I often use a version control system like subversion.

It lets you keep track of changes and also lets different people work on different sections in the paper concurrently, since the system can merge changes when they are committed.

This works great for plain text documents.  So great for LaTeX documents.  Not so well for Word documents.  Not that I write that many papers in Word, but I’m collaborating more and more with people who do, and that means sending copies of Word documents around by email (with the obvious problems of merging changes manually and such).

On a current project we are using Google Docs instead.

It is much better.  There is a master copy, so everyone works on the same document (no manual mergin), we can work on it concurrently, and it keeps track of changes.

Plus, as I just found out today, you can even work on documents when offline, if you have Google Gears installed.

It also works great for sharing small additional notes and thoughts on the project.  Normally I would write my thoughts on my blog, but for some projects I cannot really share all our work until we publish.  Having the ideas on Google Docs is a nice alternative then, especially since my collaborators can then add to them and make them even better.

There are only a few features I really miss.

Like a LaTeX plugin for writing math.  With just plain text, math is very hard to get right.  Yes, you can also share LaTeX documents, but you have to pull the code down in a text file to process it, and that kind of defeats the purpose.

I would also love to see proper handling of references.  Like an integration with Zotero or something like that.  That would be sweet.