I feel both very stupid and very clever this morning.

In our CoalHMM methodology we look for patterns of incomplete lineage sorting from which we can infer population genetics parameters from ancestral species and information about speciation times (as opposed to divergence times).

For a while now, more than a year, we’ve been working on a different parameterisation of the model that directly lets us work on the parameters of interest. In the former model (the PLoS Genetics paper I linked to above) we extract the parameters in post processing. In the new parameterisation we work on them directly and can easier put constraint on them to match prior knowledge to the model, such as present day effective population sizes or speciation times from paleontological data.

One of our post docs has completely re-implemented the model plus made it computationally tractable to scan complete genomes with it. It’s been a lot of work, but we now have 3-4 papers in the pipeline; a couple of which within weeks of submission. Which is good, since we haven’t published anything on our work for this long while.

Anyway, there is an issue in the mathematical model that has bothered me the whole while. I won’t go into details here until the paper is out and you can check it yourself, but it has to do with some of the assumptions made to make the math tractable. After all, we are reducing the rather complex process of coalescence with recombination to a Markov model along the sequence, which it isn’t reall, so some simplifications are necessary.

The process is a Markov chain when running in time, but when running along the sequence it is not since the history of each nucleotide depends on all the others to some degree. It is just that this relationship drops off with distance, so we can approximate it somewhat with a Markov chain.

One of the simplifications we have to model the process along the sequence as a Markov chain is to assume that the past histories of two nucleotides are independent after they have been unlinked by a recombination. This isn’t unreasonable for short time scales, but for longer time scales it is not quite right either; after a recombination two neighbouring nucleotides can coalesce back together again and the result when seen today is exactly as if they never recombined in the first place.

This way we get an invisible occurence, and if we ignore those we will underestimate recombination rate (because we observe fewer recombinations than really occur).

We know that this happenes, of course, but we sort of hack our way out of it. It *does* mean that we will still underestimate recombination for long branches in the species phylogeny (as opposed to the coalescence genealogy where we have other problems as well).

I did some simulations that showed that it wasn’t much of a problem on the parameters we use. We do see a bias that we have to correct for, but based on the simulations we decided that it probably wasn’t because of this effect and left it at that.

It still bugs me, so I’ve been thinking about dealing with the problem but got stuck.

Until yesterday, where I discussed it with Carsten Wiuf who has done a lot of work on coalescence theory. After I explained the problem to him, he asked if I wasn’t just modelling a continuous time Markov chain. Doh!

Yes, that is exactly what it is, and I *knew* that. I have even used that approach in a similar project. I just didn’t think about it in this case.

You can easily model the process as such, and though it gets complicated with more lineages it is not too hard to do (as long as you still keep the “Markov along the sequence” assumption).

With just this one insight, I’ve managed last night to reformulate the problem plus make the math much simpler. There are still a lot of details to get right, of course, but I am very optimistic about it.

It is just too bad that it is this late in the process. We don’t have time to re-do a years work and still contribute to the data analysis projects we are involved with, but I hope that the new approach can improve the CoalHMM methodology in the future.

—

30-49=-19

So you have two markov chains?

One along a sequence and one for the coalescence?

Yes, exactly. The one for the coalescence is actually a Markov chain. The past only depends on the present not on what happens later than the present. Yeah, that sounds weird, but that is how coalescence theory works :)

The one along the sequence is an approximation to dealing with the ancestral recombination graph. This is not really a Markov chain, ’cause the nucleotide to the right of your current position will have a past that depends on all the nucleotides to the left of you. The two-nucleotide model I described above really tells you why: The number of lineages (carrying ancestral material) at any point in time depends on how many have coalesced, and that means the entire ancestral recombination graph.

but as you know from LD, the dependency shrinks with distance, so you can approximate it with a Markov chain.

The idea I wrote about in this post improves on the approximation along the sequence, by more accurately capturing the dependency of the right nucleotide of a pair on the left of the pair. It is still a crude approximation on the true process, but it deals better with “back coalescence” events.

There is actually another Markov chain running in the setup. Mutations are also put on the local genealogies using a continuous time Markov chain. The coalescence/recombination chain should really handle the mutations as well, but it is easier to deal with them separately.

That gives us another bias, though, and I haven’t figured out how to deal with that yet. Asger did some calculations that showed that the effect was negligible but we have now pretty much convinced ourselves that they are not and introduce a bias in how we estimate speciation times. We correct that one the same way we correct the bias on recombination (using a Monte Carlo strategy), but in future work we hope to get the calculations right.

I wasn’t optimistic about that last week. It is dealing with a number of multiple integrals. But perhaps the new formulation as Markov chains for the transitions might also solve this problem. I haven’t figured that part out yet, but I’m working on it :)