I’ve put a roadmap for our association mapping software up on my BiRC homepage. It is a bit of a mix of my old homepage design and some php to synchronize it with our bug database. I really don’t know php, so I’m not sure it is an ideal design. It is only php because we use Mantis for our bug database. I really don’t want to write my own bug database, so that is the way it is.
We just published a new paper. The paper concerns a problem that Jotun’s been working on since he and Carsten Wiuf published some results on the distribution of ancestral material of a present day sample back in time. Jo Davies worked on it as a summer student project years back, and last year we returned to the problem when Frank Simancik did a summer student project.
On Recombination Induced Multiple and Simultaneous Coalescent Events
J. Davies, F. Simancik, R. Lyngsø, T. Mailund, and J. Hein
Genetics 177: 2151–2160 (2007). doi:10.1534/genetics.107.071126
Abstract: Coalescent Theory is almost ubiquitous in contemporary molecular population genetics. Inherent in most applications is a continuous time approximation that assumes sample size is small relative to the actual population size. This assumption in effect precludes simultaneous and multiple coalescent events, which can constitute an arbitrarily large component when sample size is sufficiently large. In most situations this is justifiably ignored as a large sample size will only have few ancestors a couple of generations back and then the assumption is valid. However, in tracing the evolutionary history of large chromosomal segments, a large recombination rate will consistently keep the number of ancestors large such that multiple and simultaneous coalescent events cannot be ignored. This can create a major disparity between discrete time and continuous time models and we here show its importance illustrated with parameters typical of the human genome. The presence of gene convergence only aggravates its importance. This could seriously undermine the application of coalescent theory to complete genomes. However, it can be shown that multiple and simultaneous coalescent events influences global quantities, such as total number of ancestors, but has negligible effect on local quantities, such as linkage disequilibrium or similarities of close local trees. Reassuringly the majority of applications of coalescent models with recombination are based on local quantities for purposes such as association mapping.
What is the problem?
If you sample DNA from present day individuals and then trace its history back in time you will see coalescent events and recombination events. Coalescent events occur when two lines, as we trace them back in time, join (or coalesce). This correspond to, when considered moving forward in time, a cell divides to eventually produce two siblings who are both ancestors of individuals in our present day sample. Recombination events occur when a single line, moving back in time, split into two. Considered forward in time, this correspond to two lines combining in a chromosomal recombination.
This process can be modelled mathematically, and the theory for this is called coalescent theory. A nice introduction can be found in the book by Jotun, Mikkel and Carsten: Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory ISBN 978-0198529965.
The mathematical proces is, of course, an approximation to the real process. The real process is probably too complex to model mathematically, and if it is possible to model it, the mathematics would be too complex to give any insight to the process in any case.
However, on approximation made in the mathematical model is potentially problematic. The model assumes that coalescent events and recombination events occur so rarely, on the time scale considered, that two essentially never occur at the same time.
For a small sample in a large population, this assumption is justified. The probability of multiple events at the same time is essentially zero. When the sample size is on the same order as the population size, however, the assumption is no longer valid.
This hasn’t really been a major issue, since even for large samples, the time it takes for a large sample to coalesce into a small sample is very short compared to the time it takes for the entire process to run.
That is, if we ignore recombination events!
Recombination events produce new lines, as we move back in time, just as coalescent events remove lines. If we consider ancestral material, which is the DNA we sampled at present day, the coalescent events will eventually win and reduce the material such that each nucleotide is only found in a single line. If we also consider non-ancestral material, DNA that belonged to an ancestor of our sample but that did not get passed on to the present day sample, then we reach an equilibrium between coalescent events and recombination events that keeps several lines moving back in time.
It is this situation that Carsten and Jotun considered in their paper
The ancestry of a sample of sequences subject to recombination [pdf]
C. Wiuf and J. Hein
Genetics 151: 1217-1228 (1990).
and as it turns out, it is possible for the number of lines to remain large, compared to the population size, if only the recombination rate is sufficiently high. In fact, the number of lines can be larger than the population size!
This sounds like a major problem with the theory, but it isn’t really. It is just applying the theory to a part of parameter space where essential assumptions are no longer valid.
If we consider single genes, as the theory intended, the recombination rate is low and there is no problem with the theory. If we start considering entire chromosomes, however, we enter the parameter space where the theory breaks down!
What is the result?
We considered this problem and simulated the process both when allowing multiple events (using a simpler, but computationally shower, method) and when assuming that they do not occur.
The model that allows multiple events changes the equilibrium behaviour of the system. The number of lines, as we trace them back in time, changes, and we no longer end up in the strange situation of having more lines than individuals in the population.
Local properties, however, such as the phylogenies at individual nucleotides and the linkeage disequilibrium (statistical relatedness of nucleotides), are not affected by allowing multiple events. This is the good news. It means that the models we have used when developing association mapping tools are just as valid as they have allways been.
I just got back from Christmas celebrations with my family. I didn’t bring my laptop this year, and it was great getting away from work for a couple of days. Sure, I brought a few books, but reading up on numerical methods for ODEs can hardly be called real work — it belongs in the relaxation category.
In any case, it is very different from the last two Christmases, where I’ve had to prepare tutorials for PSB. Going to Hawaii just after New Year is great and all, but it sort of ruins the holiday that I have to work through it.
Anyway, now I am back in Aarhus and will head off to the office in a little while. I’ll only work a few hours, though, and not too seriously. I have a few pet projects that I haven’t had time to look at before now. The days between Christmas and New Year’s Eve is perfect for those.
Surfing around before bedtime I stumbled upon this blog: Math for Programmers.
I liked it a lot and agree on most points in it.
Then, I started thinking about my own introduction to statistics. I had the mandatory classes on probability and statistics while doing my comp. science degree and pretty much hated the stats. part of it (less so the probability, ’cause I have a soft spot for pure math). It wasn’t until I really needed to know stats. for my own work I started getting into it, and then I found that it was actually pretty easy. I guess most things get a lot easier once you are motivated for it…
A very cool and very thought provoking video: