Posts Tagged ‘MCMC’

Mixing problems

Friday, May 1st, 2009

No, not mixing a lot of different problems, but problems with mixing.  In an MCMC.

So, I’ve implemented this quantitative trait model in HapCluster this week, and I am now testing how it works.  The first few tests when pretty well, but now I’m running into problems with some other data sets.

HapCluster is a haplotype based fine mapping tool that essentially has a single parameter of interest, x, that is the position of a trait marker (either a risk marker for a disease or in the new model a marker affecting a quantitative trait).  The model has a lot of nuisance parameters as well, used for comparing haplotypes and clustering them locally around x.

Only three of the parameters are continuous real values, x, \delta and \gamma.  These are easy to work with (compared to clustering of haplotypes or inferred ancestral haplotypes and parameters like that), so I use these three to check for convergence and to get an idea about the effective sample size.

A run with HapCluster starts out with two “burn in” runs, used to test convergence and estimate the effective sample size.  I do these to runs, and then test if the distribution of the three parameters is the same in the two runs.  If they are, I conclude that the chain has converged, otherwise I do a third run, compare it with the second run, and so forth.

After each run, I increase the thinning – that is the number of parameter points I skip between each sample.  If the chain is mixing poorly, I need more data points to get an estimate of the true distribution, and I can either get that by making longer runs, or by increasing the thinning to reduce the auto-correlation.  It is the latter I use in the burn-in period.

If, after a number of burn in iterations, I conclude that the chain has converged, I estimate the effective sample size (which is smaller than the actual sample size due to the auto-correlation) and then figure out how many samples I need to make to get the desired effective sample size, and then I run a final chain with that many iterations (using the thinning from the last burn in iteration).

All this works pretty well for case/control data, except that the \delta and \gamma parameters some times are mixing poorly (have a high auto-correlation).  Not too bad, though, and usually the increased thinning in a burn in iteration or three solves the problem.  In any case, they are nuisance parameters and do not seem to affect x much anyway.

With the quantitative traits model, though, it looks like it is x that has problems with mixing, and worse, it gets stuck far from the trait position in some cases.

In the plot here, the dashed red line is the position of the trait marker and the different colours in the parameter estimates corresponds to different burn in iterations.

I’m not sure exactly how to deal with this.  Mixing is always causing me problems when I work with MCMCs…

121-136=-15

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Friday, January 16th, 2009

There’s an interesting paper in the current issue of Bioinformatics that I’ve just finished reading:

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Webb et al. Bioinformatics 25(2) 197-203

Abstract

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.

Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

The paper presents a new method for analysing sequences that have undergone recombination.

When sequences have not undergone recombination, a nice methodology for analysing them is the PhyloHMM (PDF). With this method, you have a hidden Markov model where the emission probability is determined by a phylogeny, and usually computed using Felsenstein’s pruning algorithm.

When there is recombination, the problem is that there are more than one topology for the underlying phylogeny, and if you do not know the topologies you cannot immediately calculate the emission probabilities.

You can instead model the unknown topologies as hidden states. This approach was taken by Husmeier and McGuire (2003) and is also the approach we take in our CoalHMM method (Hobolth et al 2007).

This approach doesn’t scale, however, since the number of possible toplogies grows super-exponential with the number of species.

In this paper the solve the problem by using only a few topologies as states in the HMM, but sampling over all possible topologies to be used, in an MCMC approach.  Ideally the number of topologies should be variable, but that requires a reversible jump MCMC and they haven’t implemented that.  Still, it seems to work very well.

I remember discussing the problem with both Alex and Chris when I was last in Oxford, but back then it didn’t work so well, so I am happy to read that they’ve solved the problems. Properly handling recombination and changing topologies is important for accurate sequence analysis.


A. Webb, J. M. Hancock, C. C. Holmes (2008). Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, 25 (2), 197-203 DOI: 10.1093/bioinformatics/btn607
16-29=-13

Bayesian interaction mapping

Thursday, April 3rd, 2008

ResearchBlogging.org
If you want to test for an association between a genetic marker and a disease phenotype, and you want to test this genome wide, you need to test a lot of markers. To separate random association from true association, you need to correct for the number of tests you are doing, and the more tests you are doing the more extreme your test statistic needs to be before you consider it a signal rather than random noise. For a genome wide test, you will need about half a million tests, and that is a lot of tests. It is nothing, however, compared to the number of tests you need if you also want to take interaction between markers into account. The number of pairs of markers grows as the number of markers squared, the number of triples as the number of markers cubed, and so on. If you want to test all pairs of 500,000 markers, you need 124,999,750,000 — almost 125 billion — tests. You need extremely low p-values to call anything significant with that many tests.

Wouldn’t it be great if you could get out of the multiple testing problem? Test everything in a single test and, as if by magic, highlight the probable marginal or interacting signals?

A Bayesian approach

Theoretically, that is possible through Bayesian statistics. There you can build a model that, instead of needing a lot of independent tests, directly tells you the probability that each marker, or combination of markers, contributes to the disease risk.

In this paper, they develop such a model:

Bayesian inference of epistatic interactions in case-control studies

Yu Zhang and Jun S Liu

Nature Genetics 39, 1167 – 1173 (2007)

 

Abstract

Epistatic interactions among multiple genetic variants in the human genome may be important in determining individual susceptibility to common diseases. Although some existing computational methods for identifying genetic interactions have been effective for small-scale studies, we here propose a method, denoted ‘bayesian epistasis association mapping’ (BEAM), for genome-wide case-control studies. BEAM treats the disease-associated markers and their interactions via a bayesian partitioning model and computes, via Markov chain Monte Carlo, the posterior probability that each marker set is associated with the disease. Testing this on an age-related macular degeneration genome-wide association data set, we demonstrate that the method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.

The model is very simple and is based on spotting differences in genotype frequencies in cases and controls. It splits the set of all markers into three sets: those with no association with the phenotype, those with marginal (no interaction) association with the disease, and those interacting. For those with no disease association, the model says that the genotype frequencies should be the same for cases and controls. For those with marginal effects, cases and controls have different distributions, but there is no interaction effects, and the last class has frequencies that are 1) different between cases and controls and 2) frequencies for, say, pairs of markers are not just the product of the marginal frequencies.

This isn’t different from any old frequentist approach, but there you would maximise the likelihood under different models and compare them. Alternatively, which is the Bayesian approach, you can give the frequencies a prior distribution (in a setting like this you would use the Dirichlet distribution which is the conjugate prior) and then integrate over all possible frequencies. This gives you the likelihood of the model, without the parameters (the frequencies).

Of course, you still need to figure out how to partition the markers into the three classes. The approach they take to this is to construct an MCMC that can explore the space of possible splits in a way that lets you sample classifications from the distribution you want: the classifications are seen in the proportion that matches their probability of being correct.

Does it work, then?

The model, constructed this way, is very simple. It is all based on simple frequencies so there is nothing complicated going on there, and because of the conjugate Dirichlet priors, it is computationally efficient to compute the different configuration likelihoods, so you can very efficiently move through the state space.

I am still somewhat sceptical, though.

The state space is very large. I mean very very very large. Moving beyond astronomical numbers. Even economical numbers. Big! State! Space! How many ways can you group n elements in three groups? 3n. That’s 3 to the power of n. If n is 100, that is about 5×1047. If n is 500,000 it is as close to infinity as you could possibly want. It’s a pretty large state space to explore, right?

Just because the state space is large doesn’t mean that we cannot explore the important parts of it, of course. We are no going to see most of the points in the state space, so we are going to conclude that a lot of states have zero probability, even if they probably do not, but that is not much of a problem if the probabilities are very small anyway. We only have a problem if we say that at state that should have a high (or moderate) probability actually has zero probability.

The cool thing about MCMCs is that the sometimes can explore the important parts of a state space, even if the state space is extremely large. The only reason they have any kind of success in this paper is exactly because MCMCs are cool this way.

The catch is that the state space needs to have some kind of structure to it for this to work.  There needs to be some kind of “locality” (in lack of a better word) such that if you are close to a high likelihood state you are more likely to move into a more likely one that to an arbitrary one.  If the states have probabilities essentially independent of their neighbours, then the MCMC is essentially just guessing at random, and if you guess at random you are not going to find, say, the two interacting markers out of the 125 billion pairs.  You are not better off with an MCMC than with a deck of Tarot cards.

I think the approach described in the paper is very cool, but in all honesty I don’t think it will work.  I do not believe that there is sufficient structure in the data that it is possible to search the state space in this way.  I’ll be very happy to be proven wrong, though, ’cause it is methods like this that could get us out of “multiple testing hell”.


Zhang, Y., Liu, J.S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39(9), 1167-1173. DOI: 10.1038/ng2110

Painting “go faster” stripes on HapCluster

Friday, March 21st, 2008

These days I’m working a lot on my HapCluster association mapping tool. I’m hoping to make a version that scales to genome wide studies.

My “back of an envelope” calculations based on an older version of HapCluster told me it should be possible to analyse a whole genome data set in about a CPU week, which is roughly the time it takes to phase it using the BEAGLE tool, which is the fastest I know about. Since HapCluster doesn’t need phasing, it would then be just as fast as analysing the data with Blossoc after phasing (which is much faster than using Blossoc on data that isn’t phased).

Unfortunately, some recent changes to HapCluster has made it much slower…

How did I get started with HapCluster?

HapCluster is a method developed by Ed Waldron, David Balding and John Whittaker, all from Imperial College, London (although Ed is now working for a pharma company, as far as I know). The method is described in the paper

Fine Mapping of Disease Genes via Haplotype Clustering.
E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

I got involved with the method after attending a talk by Ed in Oxford (and discussing it with him in the pub afterwards — take that, recent Beer study!). He had implemented the method in R and I suggested to re-implement it in C++, mainly because I wanted to make a fair comparison with Blossoc — Blossoc was slightly better at localising disease variants, but a runtime comparison with an R implementation would be a bit unfair.

So anyway, he sent me the R code and I translated it into C++, but along the way got a few ideas to further improvements and started working with those. Later I met up with Ed and David a few times to continue the work. I’m going to visit David for about a week next month as well.

How does HapCluster work?

The method is quite simple, which is probably also why it is reasonably fast. It is based on clustering sequences, based on their local similarity around a possible disease locus, and then assign a likelihood to the clustering based on the way cases and controls are distributed in the clusters.

The disease locus, the clustering, and everything else, is explored using a Markov Chain Monte Carlo (MCMC) method. From this, the posterior probability for the location of the disease locus can be extracted, and association can be tested through the Bayes factor when comparing the HapCluster model with a null model that assumes that cases and controls are distributed independently from the genotype.

It is the latter, the Bayes factor, I plan to use in a genome wide setting. I will run the method in sliding windows along the genome, and for each window extract the Bayes factor.

How did I slow it down?

With the help of two students, I refactored the code to make it easier to extend the method. I had made an unholy mess of template programs to mix different likelihood methods and different types of data without paying a runtime penalty, but that code just turned out too hard to modify, so we re-wrote the code to use dynamic virtual function dispatching (taking care to limit how often we called virtual functions so we wouldn’t slow the code down too much).

It did slow the code down a little bit, but not enough to be a major problem. But then I added code to automatically detect convergence of the Markov chain, and to adjust the number of iterations of the chain to always get the same effective sample size. This is necessary if I want to deal with genome wide data, where I cannot manually examine the output of each run, but now I have a time problem. In some data sets, it takes ages for the diagnostics to accept that the chain has converged.

How do I speed it up?

It seems to me that I have two ways to speed up the method: Make each iteration in the Markov chain faster, or take fewer steps.

Speeding each step up is not as easy as it sounds. I am spending around 80% of the total runtime in the same function, which is the function that compares the local sequences. If I could just speed that function up, I would gain a lot, but the code there is pretty tight so I don’t see any obvious ways of doing it. I am playing with the idea of re-writing the data structure I have the sequences in, so I can exploit parallel instructions on the newer chips, but it is a major change to the tool that might not give me much in the long run, so I will keep that idea for a thesis student some day and not bother with it myself :)

Taking fewer steps is not immediately a good idea. I need the chain to have converged before I can start sampling from it, and I really do need to get the right number of effective samples to get a reasonable result. If the problem is that on some data, the chain takes longer to converge, there is really not much I can do about it.

But I have a suspicion that the chain really has converged and it is just my diagnostics that gets confused. The way I determine convergence is somewhat ad hoc as all such methods are. I sample the numerical parameters I have in the model and compare the first half of the samples with the second half of a chain run for a fixed number of iterations. If a t-test says they have different means, I assume that the chain didn’t converge, but if the t-test says they have the same mean, then I accept convergence.

Now, I have noticed that when the chain takes a larger number of iterations before the diagnostics accept convergence, then the number of samples needed to get a fixed effective number of samples is very high. This means that in the cases where the burn in takes a long time, I have a higher auto-correlation between samples.

If the chain moves slowly through the state space, then even if it has converged I will probably see different means between the first and second half of a run.

My idea now is to increase the thinning between samples, based on estimates of the effective samples size, during burn in. It means that I will increase the number of iterations in some cases, but if my theory is correct, I will not run into the problems where I have converged but my test keeps rejecting this.

Let’s see how it works…

Estimating parameters of speciation models

Monday, February 18th, 2008

Another paper that addresses the speciation process in apes is:

A new approach to estimate parameters of speciation models with application to apes

Becquet and Przeworski

Genome Research 17:1505-1519

Abstract

How populations diverge and give rise to distinct species remains a fundamental question in evolutionary biology, with important implications for a wide range of fields, from conservation genetics to human evolution. A promising approach is to estimate parameters of simple speciation models using polymorphism data from multiple loci. Existing methods, however, make a number of assumptions that severely limit their applicability, notably, no gene flow after the populations split and no intralocus recombination. To overcome these limitations, we developed a new Markov chain Monte Carlo method to estimate parameters of an isolation-migration model. The approach uses summaries of polymorphism data at multiple loci surveyed in a pair of diverging populations or closely related species and, importantly, allows for intralocus recombination. To illustrate its potential, we applied it to extensive polymorphism data from populations and species of apes, whose demographic histories are largely unknown. The isolation-migration model appears to provide a reasonable fit to the data. It suggests that the two chimpanzee species became reproductively isolated in allopatry ~850 Kya, while Western and Central chimpanzee populations split ~440 Kya but continued to exchange migrants. Similarly, Eastern and Western gorillas and Sumatran and Bornean orangutans appear to have experienced gene flow since their splits ~90 and over 250 Kya, respectively.

becquet-przeworski-fig1.pngIn this they develop a method to infer the coalescence parameters in a model that is essentially a population split with migration (click on the figure for details).

The effective population sizes, the Ns, tells us something about the diversity of the species (where NA tells us about the ancestral species). The split time, T, gives us the speciation time, and the migration parameter, m, tells us something about the way the speciation occured (an allopatric vs parapatric model).

As usual for coalescence models, the full likelihood of the parameters is computational demanding to compute, so the authors use summary statistics instead — somewhat like an Approximate Bayesian Computation (ABC) method if you can call it that when you want to match the summaries exactly — and then develop a Markov Chain Monte Carlo (MCMC) method to sample from the likelihood function over the summary statistics.

Based on this model, they then estimate speciation times for sub-species of chimps, gorillas and orangutans.


Citation for Research Blogging:Becquet, C., Przeworski, M. (2007). A new approach to estimate parameters of speciation models with application to apes. Genome Research, 17(10), 1505-1519. DOI: 10.1101/gr.6409707