Archive for May 15th, 2008

Bayesian Graphical Models for Genomewide Association Studies

Thursday, May 15th, 2008

ResearchBlogging.org
Lately I have been interested in Bayesian approaches for genome wide association mapping. These have the benefit of often being able to consider all data in a single model and from that score markers with their probability of being related to the disease without the usual multiple testing problems.

So for our journal club today, I picked the following paper for us:

Bayesian Graphical Models for Genomewide Association Studies
Verzilli, Stallard and Whittaker
American Journal of Human Genetics 79(1) 100-112

Abstract

As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.

The idea in the paper is to capture the joint distribution of all genotypes and phenotypes in a graph model — that explicitly capture which model variables are independent and which are not — and from this read off the probability that each genotype is independent from the phenotype or not.

By putting a few restrictions on the topology of the kind of graphs they consider, it is possible to calculate the likelihood of any given graph by, essentially, running through the graph and calculate conditional probabilities of all cliques in the graph, where each clique is modeled as a multinomial distribution over genotypes, possibly together with phenotypes. With a small re-write, this becomes a product of independent probabilities divided by another product of independent probabilities.

p(C1) x p(C2) x … x p(CL) / p(S1) x p(S2) x … x p(SR)

Using Dirichlet priors, the parameters of the multinomial distributions can be integrated out, and the resulting expression is very fast to compute, making it feasible to sample over graphs in an MCMC.

This integral, though, I don’t think is correct. It comes down to those products of independent probabilities. These guys depend on the multinomial parameters, and while the terms in the nominator depend on disjoint parameters and the same for the denominator, there is an overlap in the parameters used in the nominator and denominator that introduces dependencies.

p(C1|θ) x p(C2|θ) x … x p(CL|θ) / p(S1|θ) x p(S2|θ) x … x p(SR|θ)

= p(C1|θ1) x p(C2|θ2) x … x p(CL|θL) / p(S1|θ1) x p(S2|θ2) x … x p(SR|θR)

When integrating over the set of parameters, θ, if these could be split into disjoint parameters for each of the independent probabilities, the integral could be solved for each term individually — which is easy with a conjugate prior. When the parameters cannot be split in disjoint sets — and here they cannot — then that trick doesn’t work.

In the paper they do it anyway, though.

This just means that their model introduces more independence than it was really supposed to, and that it the model they describe is not really the model they run the MCMC on, but all that really matters is how well the method identifies genotype-phenotype association, and that it seems to be pretty good at.


VERZILLI, C. (2006). Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics, 79(1), 100-112. DOI: 10.1086/505313

Correcting machine learning hand-ins

Thursday, May 15th, 2008

I’ve been correcting the hand-ins for the first project in my machine learning class. It is a very simple exercise where the students are given five data sets with predictor variables and target values, and from them they need to train a model (just using linear regression) and then predict targets for new predictor values.

They can transform the predictor variables in any way they want to come up with a good set of basis function to then use in the linear model, and this is really the only tricky part in the exercise. After that, it is a simple programming task to fit the model.

Some of the data sets are easy enough to work with, like the first one that is a simple line with gaussian error around it. This is the typical linear regression setup. Other data sets are harder to figure out, but the take-home message is that it doesn’t really matter so much if you can work out the true model specification, as long as you can make predictions better than mere guessing (although, in one of the data sets the predictors and targets are independent, just to show that that is also always a possibility).

The only measure that matters is the prediction accuracy on the new data, and that is what I have been looking at for the hand-ins.  I want to reduce this to a single score so I can pick a “winner” and give him a little prize.

For each dataset I’ve reduced the prediction to a single score, by taking the square-root if the sum if errors squared and then divided it with the mean of the true target values.  This scales the errors to “standard errors” so I can compare the individual models.

Still, some models are much harder to make predictions  about, so to take that into account, I’ve taken the mean of the errors in the hand-ins for each model and divided the individual errors with that.  That way, the difficult models count for less than the easy ones.  The waited sum of errors is then the final score.

Lost a finger nail and taking a break

Thursday, May 15th, 2008

You might be wondering why I’ve just been posting video clips the last couple of days rather than my usual posts. Well, in the weekend I fell on a chair and tore off a finger nail, so typing is a bit slow for me these days. I’m essentially using only my left hand, although I’m slowly getting used to using only three fingers on my right hand (the nail on the ring finger is missing and my finger is wrapped up in bandages so using that finger and the pinky is a problem).

It’s like learning to type all over again, and rather slow, so I’ve been taking a break for a few days. I haven’t gotten much work done either, but I have managed to prepare a talk I am giving tomorrow. I haven’t any problems with using a mouse, so preparing slides was not a problem.

Anyway, expect shorter posts the coming week…