In our association mapping journal club a few weeks back, we discussed this paper (I just never got around to writing down my thoughts on it until now):
Hoggard, Whittaker, De Iorio and Balding, PLoS Genetics 2008
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
I already heard about the method when I was visiting Imperial College to give a seminar last year, so I am happy that I can finally talk about it.
It is a pretty neat idea.
Regression analysis in association mapping
If you want to figure out which parameters are important for predicting some property, a good old statistical approach is regression analysis.
For a binary property, such as case or control in an association study, you could use logistic regression, but in general you construct some linear function of your parameters and transform them into the “property space” through a link function.
This setup gives you a “model” and depending on the link and the setup you have different ways of interpreting this as a statistical model with a corresponding likelihood function.
The coefficients in the linear combination of parameters are the parameters in the model, and you typically maximize the likelihood with respect to them to get your estimate for them.
In some cases you can directly interpret the parameters, but more often than not you are only interested in knowing whether there is strong evidence in the data that they should be non-zero, i.e. that the parameter in question actually has an effect on the property.
In an association study, you would use your SNPs as your parameters and you consider those SNPs with a non-zero coefficient associated with the disease.
Of course, it is never as simple as that.
Two things complicate matters: your best estimate of a coefficient will never actually be zero, so you want to test if they are significantly different from zero. Another problem is that you have many more parameters (SNPs) than you have outcomes (individuals), so you will overfit from hell.
Strong “zero” priors
What they do in this paper is both simple and very clever.
They consider the problem in a Bayesian setting and put strong priors on the coefficients, that will tend to keep them at zero unless the signal in the data is strong enough to pull them away from there.
They then test for association by testing if the mode of the posteriors for these parameters have moved away from zero.
A very nice consequence of this is that you can analyse the entire data at the same time, rather than testing markers individually, which means that if several markers are in LD with a causal marker, you will tend to only pick one of them and recognize that the signal in the others is essentially the same signal.
It also seems quite computationally feasible. A few hours on a desktop computer to analyse a GWA data set.
Clive J. Hoggart, John C. Whittaker, Maria De Iorio, David J. Balding, Peter M. Visscher (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies PLoS Genetics, 4 (7) DOI: 10.1371/journal.pgen.1000130