Archive for May 1st, 2009

Insignificant, but significant, differences

Friday, May 1st, 2009

I have this HMM that lets me estimate speciation times and effective population sizes.  Part of the CoalHMM work we are doing at BiRC, but a completely new model that Julien and I just implemented over Easter.

I am using it to analyse the two orangutan subspecies and there is a slight problem that I am checking to see if it affects the results.

In simulations we have seen that the estimates can be biased, but in general if we have enough states in the HMM the bias shrinks (we can vary the states, that corresponds to coalescence times, so the HMM is really a class of HMMs with different number of states).

For my analysis, I have tried running the HMM on chromosome 22 with different numbers of states to see how it affects the parameter estimates:

The estimates are independent estimates for 1Mbp “chunks” along the genome, and clearly there is a consistent difference in the estimates when varying the number of states.  You can test this with a paired t-test or just look at plots of the differences in the estimates.

So the difference is statistical significant, and except for the recombination rate it seems that increasing the number of states changes the estimates in the same direction.

Compared to the variance in the estimates along the chromosome, though, the differences between using different number of states is tiny.  Completely insignificant, just not statistical insignificant.

Not sure what to make of that, except that the number of states probably does not have much affect on the final analysis results.

121-137=-16

Mixing problems

Friday, May 1st, 2009

No, not mixing a lot of different problems, but problems with mixing.  In an MCMC.

So, I’ve implemented this quantitative trait model in HapCluster this week, and I am now testing how it works.  The first few tests when pretty well, but now I’m running into problems with some other data sets.

HapCluster is a haplotype based fine mapping tool that essentially has a single parameter of interest, x, that is the position of a trait marker (either a risk marker for a disease or in the new model a marker affecting a quantitative trait).  The model has a lot of nuisance parameters as well, used for comparing haplotypes and clustering them locally around x.

Only three of the parameters are continuous real values, x, \delta and \gamma.  These are easy to work with (compared to clustering of haplotypes or inferred ancestral haplotypes and parameters like that), so I use these three to check for convergence and to get an idea about the effective sample size.

A run with HapCluster starts out with two “burn in” runs, used to test convergence and estimate the effective sample size.  I do these to runs, and then test if the distribution of the three parameters is the same in the two runs.  If they are, I conclude that the chain has converged, otherwise I do a third run, compare it with the second run, and so forth.

After each run, I increase the thinning – that is the number of parameter points I skip between each sample.  If the chain is mixing poorly, I need more data points to get an estimate of the true distribution, and I can either get that by making longer runs, or by increasing the thinning to reduce the auto-correlation.  It is the latter I use in the burn-in period.

If, after a number of burn in iterations, I conclude that the chain has converged, I estimate the effective sample size (which is smaller than the actual sample size due to the auto-correlation) and then figure out how many samples I need to make to get the desired effective sample size, and then I run a final chain with that many iterations (using the thinning from the last burn in iteration).

All this works pretty well for case/control data, except that the \delta and \gamma parameters some times are mixing poorly (have a high auto-correlation).  Not too bad, though, and usually the increased thinning in a burn in iteration or three solves the problem.  In any case, they are nuisance parameters and do not seem to affect x much anyway.

With the quantitative traits model, though, it looks like it is x that has problems with mixing, and worse, it gets stuck far from the trait position in some cases.

In the plot here, the dashed red line is the position of the trait marker and the different colours in the parameter estimates corresponds to different burn in iterations.

I’m not sure exactly how to deal with this.  Mixing is always causing me problems when I work with MCMCs…

121-136=-15