Simultaneous analysis of all SNPs in a genome-wide association study

In our association mapping journal club a few weeks back, we discussed this paper (I just never got around to writing down my thoughts on it until now):

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Hoggard, Whittaker, De Iorio and Balding, PLoS Genetics 2008

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

I already heard about the method when I was visiting Imperial College to give a seminar last year, so I am happy that I can finally talk about it.

It is a pretty neat idea.

Regression analysis in association mapping

If you want to figure out which parameters are important for predicting some property, a good old statistical approach is regression analysis.

For a binary property, such as case or control in an association study, you could use logistic regression, but in general you construct some linear function of your parameters and transform them into the “property space” through a link function.

This setup gives you a “model” and depending on the link and the setup you have different ways of interpreting this as a statistical model with a corresponding likelihood function.

The coefficients in the linear combination of parameters are the parameters in the model, and you typically maximize the likelihood with respect to them to get your estimate for them.

In some cases you can directly interpret the parameters, but more often than not you are only interested in knowing whether there is strong evidence in the data that they should be non-zero, i.e. that the parameter in question actually has an effect on the property.

In an association study, you would use your SNPs as your parameters and you consider those SNPs with a non-zero coefficient associated with the disease.

Of course, it is never as simple as that.

Two things complicate matters: your best estimate of a coefficient will never actually be zero, so you want to test if they are significantly different from zero.  Another problem is that you have many more parameters (SNPs) than you have outcomes (individuals), so you will overfit from hell.

Strong “zero” priors

What they do in this paper is both simple and very clever.

They consider the problem in a Bayesian setting and put strong priors on the coefficients, that will tend to keep them at zero unless the signal in the data  is strong enough to pull them away from there.

They then test for association by testing if the mode of the posteriors for these parameters have moved away from zero.

A very nice consequence of this is that you can analyse the entire data at the same time, rather than testing markers individually, which means that if several markers are in LD with a causal marker, you will tend to only pick one of them and recognize that the signal in the others is essentially the same signal.

It also seems quite computationally feasible.  A few hours on a desktop computer to analyse a GWA data set.

Clive J. Hoggart, John C. Whittaker, Maria De Iorio, David J. Balding, Peter M. Visscher (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies PLoS Genetics, 4 (7) DOI: 10.1371/journal.pgen.1000130

New genetic variant discovered, associated with bladder cancer

I just saw this press release: deCODE and Radboud University Discover Common Variants in the Human Genome Conferring Risk of Bladder Cancer

We, here at BiRC, actually collaborate with both deCODE and Radboud Uni in the EU PolyGene project. The bladder cancer analysis is not part of PolyGene, but through the collaboration we have access to it, and we have just started analysing it.

We weren’t in on the initial analysis, though, so we are not part of this discovery. We only get access to data after they have already mined what they can find themselves. A bit annoying, but perfectly reasonable. Our contribution to the collaboration is methods development, and anything they can find with the methods they already have, they do not really need us for.

Still, it would have been nice to be in on the analysis from the beginning. Whenever we get our hands on the data, we always get excited about hits only to discover that they are already submitted for publication.

Anyway, nice to see that they get something out of the data.