Eureka!

I feel both very stupid and very clever this morning.

In our CoalHMM methodology we look for patterns of incomplete lineage sorting from which we can infer population genetics parameters from ancestral species and information about speciation times (as opposed to divergence times).

For a while now, more than a year, we’ve been working on a different parameterisation of the model that directly lets us work on the parameters of interest.  In the former model (the PLoS Genetics paper I linked to above) we extract the parameters in post processing.  In the new parameterisation we work on them directly and can easier put constraint on them to match prior knowledge to the model, such as present day effective population sizes or speciation times from paleontological data.

One of our post docs has completely re-implemented the model plus made it computationally tractable to scan complete genomes with it.  It’s been a lot of work, but we now have 3-4 papers in the pipeline; a couple of which within weeks of submission.  Which is good, since we haven’t published anything on our work for this long while.

Anyway, there is an issue in the mathematical model that has bothered me the whole while.  I won’t go into details here until the paper is out and you can check it yourself, but it has to do with some of the assumptions made to make the math tractable.  After all, we are reducing the rather complex process of coalescence with recombination to a Markov model along the sequence, which it isn’t reall, so some simplifications are necessary.

The process is a Markov chain when running in time, but when running along the sequence it is not since the history of each nucleotide depends on all the others to some degree.  It is just that this relationship drops off with distance, so we can approximate it somewhat with a Markov chain.

One of the simplifications we have to model the process along the sequence as a Markov chain is to assume that the past histories of two nucleotides are independent after they have been unlinked by a recombination.  This isn’t unreasonable for short time scales, but for longer time scales it is not quite right either; after a recombination two neighbouring nucleotides can coalesce back together again and the result when seen today is exactly as if they never recombined in the first place.

This way we get an invisible occurence, and if we ignore those we will underestimate recombination rate (because we observe fewer recombinations than really occur).

We know that this happenes, of course, but we sort of hack our way out of it.  It does mean that we will still underestimate recombination for long branches in the species phylogeny (as opposed to the coalescence genealogy where we have other problems as well).

I did some simulations that showed that it wasn’t much of a problem on the parameters we use.  We do see a bias that we have to correct for, but based on the simulations we decided that it probably wasn’t because of this effect and left it at that.

It still bugs me, so I’ve been thinking about dealing with the problem but got stuck.

Until yesterday, where I discussed it with Carsten Wiuf who has done a lot of work on coalescence theory.  After I explained the problem to him, he asked if I wasn’t just modelling a continuous time Markov chain.  Doh!

Yes, that is exactly what it is, and I knew that. I have even used that approach in a similar project.  I just didn’t think about it in this case.

You can easily model the process as such, and though it gets complicated with more lineages it is not too hard to do (as long as you still keep the “Markov along the sequence” assumption).

With just this one insight, I’ve managed last night to reformulate the problem plus make the math much simpler.  There are still a lot of details to get right, of course, but I am very optimistic about it.

It is just too bad that it is this late in the process.  We don’t have time to re-do a years work and still contribute to the data analysis projects we are involved with, but I hope that the new approach can improve the CoalHMM methodology in the future.

30-49=-19