Multiple Testing in Genome-Wide Association Studies via Hidden Markov Models
Monday, August 10th, 2009I just read a new paper out in “advanced access” in Bioinformatics:
Multiple Testing in Genome-Wide Association Studies via Hidden Markov Models
Wei et al.
Abstract
Motivation: Genome wide association studies (GWAS) interrogate common genetic variation across the entire human genome in an unbiased manner and hold promise in identifying genetic variants with moderate or weak effect sizes. However, conventional testing procedures, which are mostly p-value based, ignore the dependency and therefore suffer from loss of efficiency. The goal of this article is to exploit the dependency information among adjacent SNPs to improve the screening efficiency in GWAS.
Results: We propose to model the linear block dependency in the SNP data using hidden Markov Models. A compound decision-theoretic framework for testing HMM-dependent hypotheses is developed. We propose a powerful data-driven procedure (PLIS) that controls the false discovery rate (FDR) at the nominal level. PLIS is shown to be optimal in the sense that it has the smallest false negative rate (FNR) among all valid FDR procedures. By re-ranking significance for all SNPs with dependency considered, PLIS gains higher power than conventional p-value based methods. Simulation results demonstrate that PLIS dominates conventional FDR procedures in detecting disease associated SNPs. Our method is applied to analysis of the SNP data from a GWAS of type 1 diabetes. Compared to the BH procedure, PLIS yields more accurate results and has better reproducibility of findings.
Conclusion: The genomic rankings based on the our procedure are substantially different from the rankings based on the p-values. By integrating information from adjacent locations, the PLIS rankings benefit from the increased signal to noise ratio, hence our procedure often has higher statistical power and better reproducibility. This provide a promising direction in large-scale GWAS.
Summary
The topic is multiple testing correction in genome wide association studies (GWAS), which is probably one of the most important issues in such studies. With the very large number of tests – typically hundreds of thousands to a million – you need to correct your significance value to avoid drowning in false positives.
The false discovery rate (FDR) method is a way of doing this, that essentially ranks the p-values and then picks the smallest while keeping the cumulative sum below the desired significance level. Doing this in a GWAS ignores the dependency between tests caused by linkage disequilibrium, however, and this paper improves on this by taking the dependency into account.
They do this by fitting the data to a hidden Markov model where the hidden states are associated/not-associated and the emissions are z-values (the null for not associated and a mixture of normals for associated). From this they can get a posterior probability of association/non-association for each marker, conditional on the test statistics for all markers in the genome:
where
is the state at marker
(0 if the marker is not associated with the disease and 1 if it is) and
is the test statistics for all markers.
Now they consider all the
, that is all the posterior probabilities of not being associated, order them, and pick markers as long as the cumulative posterior probability is less than the significance threshold.
They say that this approach 1) guarantees that the false discovery rate is below the threshold and 2) that it is optimal in the sense that it is the method with that guarantee that has fewest false negatives, but they refer to an appendix for the proof of that, and that appendix is not in the paper, so I cannot really check that. I would have loved to, though, since I want to know which assumptions about the data underlies this proof, but no matter…
Anyway, on with the summary.
They now validate the method with two simulation setups; one based on data simulated by a hidden Markov model, so matching the inference method and one with more realistic data. For the first simulation study they show that the FDR guarantees are met and that the new method is more sensitive than those they compare it with. For the second simulation study they essentially only show that it ranks true associations better than plain p-values.
They apply the method to a real data set and show that they are more successful in ranking markers that can be replicated in a replication cohort, again compared to plain p-values.
The good
First of all, I think it is an important problem to attack. Doing so while taking the correlation between markers – and through that the correlation between their test statistics – is definitely the way to go.
Using hidden Markov models is also very sensible. They are computationally efficient, usually easy to extend in various ways, and well founded in statistics so results are (relatively) easy to relate to.
The bad
I do have some problems with the method, though. First some minor issues.
If the method really does compute the posterior probability of a marker being associated with the disease, then you would expect to consider all markers
where
since those are the markers that are more likely to be assocated than not associated! Picking only some of them means that for the rest of them you are essentially betting for a hypothesis less likely than the one you reject.
The issue here is, of course, that the value computed is in fact not the posterior probability of being associated. The prior probability of association versus non-associated is probably not taken into account. If it is, I couldn’t find any mentioning of it, at least.
If you include this prior belief you could just include the prior odds in your test and you would have a different approach to judging significance. In practice it probably doesn’t matter much, so it is more an objection of aesthetics.
The ugly
Ok, now we come to the part about the paper I really didn’t like. The simulation studies used to validate the method.
The first simulation study, I just can’t put much trust in. Not that I think there is anything dodgy in the results reported, but the simulations are from hidden Markov models matching the inference method, and there is no way that real data is generated that way. There is nothing wrong with modelling the data as hidden Markov models – even if it is not generated by a process that even remotely resembles it – since it is just an analysis strategy anyway, but the simulation validation based on an unrealistic assumption is not particularly convincing…
The second simulation study is a more convincing setup, since here it is real LD data and a more realistic disease model. However, here the FDR is not reported, only the sensitivity of seeing a marker in LD with a causal marker in top K of the ranking. There is a better ranking with this approach than there is with just the p-values of the individual tests, but that says nothing about what the false discovery rate is. So based on the results presented, I have no way of knowing what FDR to expect on realistic data (or how often I get a real hit below the FDR threshold for that matter).
I think this is a major problem with the validation of the method. It is really only validated on data that is unlikely to resemble real GWAS data. At least, the part of the method that has to do with FDR – the ranking results are okay.
Summary
Ok, I don’t want to end up sounding all negative. It just looks that way since I ordered the criticism good -> bad -> ugly.
I stand by the criticism – I do think there are some problems with the validation – but all in all I like the method and I will definitely keep it in mind for my own future work. I will look at that, as soon as they put up the source on CRAN (the paper just says that it will be made available there, but not when).
The main problem I have with the validation is really only the claims about false discovery rate. Ok, since that is what the method is supposed to handle, that is a major problem, but as a method for ranking markers it looks pretty good. Taking neighbouring markers into account in the analysis is what does this, I think, and is what we have also observed with our methods.
…and, you know, if you have a good ranking, maybe the false discovery rate isn’t all that important! We only trust markers we validate in a replication data set anyway, and we will probably try to validate “top k” rather than all markers below a certain false discovery rate. So in practical terms, the ranking is probably much more important than the multiple test correction.
Not so in the replication, of course, there you have to be strict about significance, but I’m not sure we need to be that strict in the initial discovery data set, as long as we don’t try to replicate thousands and thousands of markers, and we are probably not going to do that anyway.
I started my review by saying that correcting for multiple testing is very important in GWAS, and it is and it is very worthwhile to develop methods for it, but improving the ranking of markers is one of the problems that is even more important.
–
Wei, Z., Sun, W., Wang, K., & Hakonarson, H. (2009). MultipleTesting in Genome-Wide Association Studies via Hidden Markov Models Bioinformatics DOI: 10.1093/bioinformatics/btp476
222-225=-3
is a property of the coin and fixed, it is just unknown, so there will be some randomness in estimating it due to the randomness in coin tosses we use to estimate.
and
. I get to toss each coin once to see if it turns up heads or tails, and then I have to throw the coin away, so for coin
.
as last time, calculate the posterior based on the number of heads out of
tosses and see if that tends toward the real distribution of 
is the distribution of 
is 1 if the ith coin toss is head and 0 otherwise.

the number of heads.
we get



