
If you want to test for an association between a genetic marker and a disease phenotype, and you want to test this genome wide, you need to test a lot of markers. To separate random association from true association, you need to correct for the number of tests you are doing, and the more tests you are doing the more extreme your test statistic needs to be before you consider it a signal rather than random noise. For a genome wide test, you will need about half a million tests, and that is a lot of tests. It is nothing, however, compared to the number of tests you need if you also want to take interaction between markers into account. The number of pairs of markers grows as the number of markers squared, the number of triples as the number of markers cubed, and so on. If you want to test all pairs of 500,000 markers, you need 124,999,750,000 — almost 125 billion — tests. You need extremely low p-values to call anything significant with that many tests.
Wouldn’t it be great if you could get out of the multiple testing problem? Test everything in a single test and, as if by magic, highlight the probable marginal or interacting signals?
A Bayesian approach
Theoretically, that is possible through Bayesian statistics. There you can build a model that, instead of needing a lot of independent tests, directly tells you the probability that each marker, or combination of markers, contributes to the disease risk.
In this paper, they develop such a model:
Bayesian inference of epistatic interactions in case-control studies
Yu Zhang and Jun S Liu
Nature Genetics 39, 1167 – 1173 (2007)
Abstract
Epistatic interactions among multiple genetic variants in the human genome may be important in determining individual susceptibility to common diseases. Although some existing computational methods for identifying genetic interactions have been effective for small-scale studies, we here propose a method, denoted ‘bayesian epistasis association mapping’ (BEAM), for genome-wide case-control studies. BEAM treats the disease-associated markers and their interactions via a bayesian partitioning model and computes, via Markov chain Monte Carlo, the posterior probability that each marker set is associated with the disease. Testing this on an age-related macular degeneration genome-wide association data set, we demonstrate that the method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.
The model is very simple and is based on spotting differences in genotype frequencies in cases and controls. It splits the set of all markers into three sets: those with no association with the phenotype, those with marginal (no interaction) association with the disease, and those interacting. For those with no disease association, the model says that the genotype frequencies should be the same for cases and controls. For those with marginal effects, cases and controls have different distributions, but there is no interaction effects, and the last class has frequencies that are 1) different between cases and controls and 2) frequencies for, say, pairs of markers are not just the product of the marginal frequencies.
This isn’t different from any old frequentist approach, but there you would maximise the likelihood under different models and compare them. Alternatively, which is the Bayesian approach, you can give the frequencies a prior distribution (in a setting like this you would use the Dirichlet distribution which is the conjugate prior) and then integrate over all possible frequencies. This gives you the likelihood of the model, without the parameters (the frequencies).
Of course, you still need to figure out how to partition the markers into the three classes. The approach they take to this is to construct an MCMC that can explore the space of possible splits in a way that lets you sample classifications from the distribution you want: the classifications are seen in the proportion that matches their probability of being correct.
Does it work, then?
The model, constructed this way, is very simple. It is all based on simple frequencies so there is nothing complicated going on there, and because of the conjugate Dirichlet priors, it is computationally efficient to compute the different configuration likelihoods, so you can very efficiently move through the state space.
I am still somewhat sceptical, though.
The state space is very large. I mean very very very large. Moving beyond astronomical numbers. Even economical numbers. Big! State! Space! How many ways can you group n elements in three groups? 3n. That’s 3 to the power of n. If n is 100, that is about 5×1047. If n is 500,000 it is as close to infinity as you could possibly want. It’s a pretty large state space to explore, right?
Just because the state space is large doesn’t mean that we cannot explore the important parts of it, of course. We are no going to see most of the points in the state space, so we are going to conclude that a lot of states have zero probability, even if they probably do not, but that is not much of a problem if the probabilities are very small anyway. We only have a problem if we say that at state that should have a high (or moderate) probability actually has zero probability.
The cool thing about MCMCs is that the sometimes can explore the important parts of a state space, even if the state space is extremely large. The only reason they have any kind of success in this paper is exactly because MCMCs are cool this way.
The catch is that the state space needs to have some kind of structure to it for this to work. There needs to be some kind of “locality” (in lack of a better word) such that if you are close to a high likelihood state you are more likely to move into a more likely one that to an arbitrary one. If the states have probabilities essentially independent of their neighbours, then the MCMC is essentially just guessing at random, and if you guess at random you are not going to find, say, the two interacting markers out of the 125 billion pairs. You are not better off with an MCMC than with a deck of Tarot cards.
I think the approach described in the paper is very cool, but in all honesty I don’t think it will work. I do not believe that there is sufficient structure in the data that it is possible to search the state space in this way. I’ll be very happy to be proven wrong, though, ’cause it is methods like this that could get us out of “multiple testing hell”.
Zhang, Y., Liu, J.S. (2007). Bayesian inference of epistatic interactions in case-control studies.
Nature Genetics, 39(9), 1167-1173. DOI:
10.1038/ng2110