Bayesian Graphical Models for Genomewide Association Studies
Thursday, May 15th, 2008![]()
Lately I have been interested in Bayesian approaches for genome wide association mapping. These have the benefit of often being able to consider all data in a single model and from that score markers with their probability of being related to the disease without the usual multiple testing problems.
So for our journal club today, I picked the following paper for us:
Bayesian Graphical Models for Genomewide Association Studies
Verzilli, Stallard and Whittaker
American Journal of Human Genetics 79(1) 100-112Abstract
As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.
The idea in the paper is to capture the joint distribution of all genotypes and phenotypes in a graph model — that explicitly capture which model variables are independent and which are not — and from this read off the probability that each genotype is independent from the phenotype or not.
By putting a few restrictions on the topology of the kind of graphs they consider, it is possible to calculate the likelihood of any given graph by, essentially, running through the graph and calculate conditional probabilities of all cliques in the graph, where each clique is modeled as a multinomial distribution over genotypes, possibly together with phenotypes. With a small re-write, this becomes a product of independent probabilities divided by another product of independent probabilities.
p(C1) x p(C2) x … x p(CL) / p(S1) x p(S2) x … x p(SR)
Using Dirichlet priors, the parameters of the multinomial distributions can be integrated out, and the resulting expression is very fast to compute, making it feasible to sample over graphs in an MCMC.
This integral, though, I don’t think is correct. It comes down to those products of independent probabilities. These guys depend on the multinomial parameters, and while the terms in the nominator depend on disjoint parameters and the same for the denominator, there is an overlap in the parameters used in the nominator and denominator that introduces dependencies.
p(C1|θ) x p(C2|θ) x … x p(CL|θ) / p(S1|θ) x p(S2|θ) x … x p(SR|θ)
= p(C1|θ1) x p(C2|θ2) x … x p(CL|θL) / p(S1|θ‘1) x p(S2|θ‘2) x … x p(SR|θ‘R)
When integrating over the set of parameters, θ, if these could be split into disjoint parameters for each of the independent probabilities, the integral could be solved for each term individually — which is easy with a conjugate prior. When the parameters cannot be split in disjoint sets — and here they cannot — then that trick doesn’t work.
In the paper they do it anyway, though.
This just means that their model introduces more independence than it was really supposed to, and that it the model they describe is not really the model they run the MCMC on, but all that really matters is how well the method identifies genotype-phenotype association, and that it seems to be pretty good at.
VERZILLI, C. (2006). Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics, 79(1), 100-112. DOI: 10.1086/505313