These days I'm working a lot on my HapCluster association mapping tool. I'm hoping to make a version that scales to genome wide studies.
My "back of an envelope" calculations based on an older version of HapCluster told me it should be possible to analyse a whole genome data set in about a CPU week, which is roughly the time it takes to phase it using the BEAGLE tool, which is the fastest I know about. Since HapCluster doesn't need phasing, it would then be just as fast as analysing the data with Blossoc after phasing (which is much faster than using Blossoc on data that isn't phased).
Unfortunately, some recent changes to HapCluster has made it much slower...
How did I get started with HapCluster?
HapCluster is a method developed by Ed Waldron, David Balding and John Whittaker, all from Imperial College, London (although Ed is now working for a pharma company, as far as I know). The method is described in the paper
Fine Mapping of Disease Genes via Haplotype Clustering.
E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)
I got involved with the method after attending a talk by Ed in Oxford (and discussing it with him in the pub afterwards -- take that, recent Beer study!). He had implemented the method in R and I suggested to re-implement it in C++, mainly because I wanted to make a fair comparison with Blossoc -- Blossoc was slightly better at localising disease variants, but a runtime comparison with an R implementation would be a bit unfair.
So anyway, he sent me the R code and I translated it into C++, but along the way got a few ideas to further improvements and started working with those. Later I met up with Ed and David a few times to continue the work. I'm going to visit David for about a week next month as well.
How does HapCluster work?
The method is quite simple, which is probably also why it is reasonably fast. It is based on clustering sequences, based on their local similarity around a possible disease locus, and then assign a likelihood to the clustering based on the way cases and controls are distributed in the clusters.
The disease locus, the clustering, and everything else, is explored using a Markov Chain Monte Carlo (MCMC) method. From this, the posterior probability for the location of the disease locus can be extracted, and association can be tested through the Bayes factor when comparing the HapCluster model with a null model that assumes that cases and controls are distributed independently from the genotype.
It is the latter, the Bayes factor, I plan to use in a genome wide setting. I will run the method in sliding windows along the genome, and for each window extract the Bayes factor.
How did I slow it down?
With the help of two students, I refactored the code to make it easier to extend the method. I had made an unholy mess of template programs to mix different likelihood methods and different types of data without paying a runtime penalty, but that code just turned out too hard to modify, so we re-wrote the code to use dynamic virtual function dispatching (taking care to limit how often we called virtual functions so we wouldn't slow the code down too much).
It did slow the code down a little bit, but not enough to be a major problem. But then I added code to automatically detect convergence of the Markov chain, and to adjust the number of iterations of the chain to always get the same effective sample size. This is necessary if I want to deal with genome wide data, where I cannot manually examine the output of each run, but now I have a time problem. In some data sets, it takes ages for the diagnostics to accept that the chain has converged.
How do I speed it up?
It seems to me that I have two ways to speed up the method: Make each iteration in the Markov chain faster, or take fewer steps.
Speeding each step up is not as easy as it sounds. I am spending around 80% of the total runtime in the same function, which is the function that compares the local sequences. If I could just speed that function up, I would gain a lot, but the code there is pretty tight so I don't see any obvious ways of doing it. I am playing with the idea of re-writing the data structure I have the sequences in, so I can exploit parallel instructions on the newer chips, but it is a major change to the tool that might not give me much in the long run, so I will keep that idea for a thesis student some day and not bother with it myself :)
Taking fewer steps is not immediately a good idea. I need the chain to have converged before I can start sampling from it, and I really do need to get the right number of effective samples to get a reasonable result. If the problem is that on some data, the chain takes longer to converge, there is really not much I can do about it.
But I have a suspicion that the chain really has converged and it is just my diagnostics that gets confused. The way I determine convergence is somewhat ad hoc as all such methods are. I sample the numerical parameters I have in the model and compare the first half of the samples with the second half of a chain run for a fixed number of iterations. If a t-test says they have different means, I assume that the chain didn't converge, but if the t-test says they have the same mean, then I accept convergence.
Now, I have noticed that when the chain takes a larger number of iterations before the diagnostics accept convergence, then the number of samples needed to get a fixed effective number of samples is very high. This means that in the cases where the burn in takes a long time, I have a higher auto-correlation between samples.
If the chain moves slowly through the state space, then even if it has converged I will probably see different means between the first and second half of a run.
My idea now is to increase the thinning between samples, based on estimates of the effective samples size, during burn in. It means that I will increase the number of iterations in some cases, but if my theory is correct, I will not run into the problems where I have converged but my test keeps rejecting this.
Let's see how it works...