Posts Tagged ‘Paper reviews’

Bayesian Graphical Models for Genomewide Association Studies

Thursday, May 15th, 2008

ResearchBlogging.org
Lately I have been interested in Bayesian approaches for genome wide association mapping. These have the benefit of often being able to consider all data in a single model and from that score markers with their probability of being related to the disease without the usual multiple testing problems.

So for our journal club today, I picked the following paper for us:

Bayesian Graphical Models for Genomewide Association Studies
Verzilli, Stallard and Whittaker
American Journal of Human Genetics 79(1) 100-112

Abstract

As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.

The idea in the paper is to capture the joint distribution of all genotypes and phenotypes in a graph model — that explicitly capture which model variables are independent and which are not — and from this read off the probability that each genotype is independent from the phenotype or not.

By putting a few restrictions on the topology of the kind of graphs they consider, it is possible to calculate the likelihood of any given graph by, essentially, running through the graph and calculate conditional probabilities of all cliques in the graph, where each clique is modeled as a multinomial distribution over genotypes, possibly together with phenotypes. With a small re-write, this becomes a product of independent probabilities divided by another product of independent probabilities.

p(C1) x p(C2) x … x p(CL) / p(S1) x p(S2) x … x p(SR)

Using Dirichlet priors, the parameters of the multinomial distributions can be integrated out, and the resulting expression is very fast to compute, making it feasible to sample over graphs in an MCMC.

This integral, though, I don’t think is correct. It comes down to those products of independent probabilities. These guys depend on the multinomial parameters, and while the terms in the nominator depend on disjoint parameters and the same for the denominator, there is an overlap in the parameters used in the nominator and denominator that introduces dependencies.

p(C1|θ) x p(C2|θ) x … x p(CL|θ) / p(S1|θ) x p(S2|θ) x … x p(SR|θ)

= p(C1|θ1) x p(C2|θ2) x … x p(CL|θL) / p(S1|θ1) x p(S2|θ2) x … x p(SR|θR)

When integrating over the set of parameters, θ, if these could be split into disjoint parameters for each of the independent probabilities, the integral could be solved for each term individually — which is easy with a conjugate prior. When the parameters cannot be split in disjoint sets — and here they cannot — then that trick doesn’t work.

In the paper they do it anyway, though.

This just means that their model introduces more independence than it was really supposed to, and that it the model they describe is not really the model they run the MCMC on, but all that really matters is how well the method identifies genotype-phenotype association, and that it seems to be pretty good at.


VERZILLI, C. (2006). Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics, 79(1), 100-112. DOI: 10.1086/505313

An MCMC-EM algorithm for statistical analysis of DNA evolution with neighbour-dependent substitution rates

Friday, May 2nd, 2008

ResearchBlogging.org

When estimating evolutionary parameters from DNA sequences, we usually assume that each site evolves independent of all the others. This is a choice of convenience — the mathematics is a lot simpler this way — and doesn’t actually match the data that well. In fact, we know several cases where the neighbouring sites greatly influence the evolutionary process. Just think of codons or CpG di-nucleotides.

The reason we like to consider the sites independent is that it gives us a very manageable model for the sequence evolution. Each site can be modelled as a continuous time finite state Markov model with four states and we can easily derive substitution probabilities from this. When the sites are not independent, then this idea breaks down. We cannot split the sequence into separate parts, so although we can still model the sequence evolution as a finite state continuous Markov model, the model now needs to consider the entire sequence as a whole. Instead of a 4-state Markov model we end up with a 4n-state Markov model, where n is the number of nucleotides. This is no longer all that manageable.

One of my colleagues at BiRC, Asger Hobolth, just published a paper addressing this problem:

A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates

Asger Hobolth

Journal of Computational & Graphical Statistics, Volume 17, Number 1, 2008 , pp. 138-162(25)

Abstract

The evolution of DNA sequences can be described by discrete state continuous time Markov processes on a phylogenetic tree. We consider neighbor-dependent evolutionary models where the instantaneous rate of substitution at a site depends on the states of the neighboring sites. Neighbor-dependent substitution models are analytically intractable and must be analyzed using either approximate or simulation-based methods. We describe statistical inference of neighbor-dependent models using a Markov chain Monte Carlo expectation maximization (MCMC-EM) algorithm. In the MCMC-EM algorithm, the high-dimensional integrals required in the EM algorithm are estimated using MCMC sampling. The MCMC sampler requires simulation of sample paths from a continuous time Markov process, conditional on the beginning and ending states and the paths of the neighboring sites. An exact path sampling algorithm is developed for this purpose.

He did the work while at North Carolina State University, though, and not at BiRC, so I have no inside information and I just read the paper today and I’m in Oxford so I haven’t been able to discuss it with him yet. Anyway, I’ll try to briefly describe the results here, and maybe go back and correct any misunderstandings later, after I’ve discussed with Asger.

The problem with neighbour dependencies

The problem is actually quite easy to understand. If the substitution rate of one nucleotide depends on its neighbours — say CpG changes to TpG or CpA faster than C changes to T or G to A in general — then, when modelling the substitution rate of one of the nucleotides, I need to know what the second is. If the second nucleotide didn’t change over time, then there wouldn’t be a problem, and I could model the substitution rate of the first as a Markov chain and derive a 4×4 substitution probability matrix. But there is no reason to think that one nucleotide can evolve but the other will just stay put, so I need to model the evolution of both. At any point in time, I need to capture their joint state. So I would have to work with a 16×16 rate and probability matrix.

Of course, if there are three nucleotides, then the first and second might be dependent while the second and third are dependent, and that indirectly makes all three dependent of each other. I always need to model the full state, so I have 4n-states and need to work with a 4nx4n rate matrix.

Theoretically, I just need to exponentiate the rate matrix, but in practise you cannot exponentiate a 4nx4n matrix for any realistic sequence length n.

Sampling your way out of the problem

Asger solves this problem by calculating substitution probabilities by sampling instead of matrix exponentiation. Sampling doesn’t get rid of the problem of having to model the state of the entire string over time, of course, so he needs two tricks.

First he derives a way of sampling the evolutionary history of one nucleotide, conditional on the history of its neighbours. Some of the details here are not clear to me yet, so I’ll refrain from saying more — I might add details later after I’ve talked to Asger. Anyway, he derives a way of sampling the evolution of a single nucleotide, conditional on all other histories.

Secondly, and this now is the easy part, he constructs an MCMC to then sample from all histories. This is easy because if you can sample from conditionals like he can, then you can just use the Gibbs sampling machinery and Bob’s your uncle.

Ok, so now he can sample paths between two DNA sequences which means he can calculate expectations over the set of all paths. From these expectations he can then estimate maximum likelihood parameters using the expectation maximisation (EM) machinery.

There’s a few more tricks needed to deal with multiple sequences, but essentially, that is the idea.

No more from me now, I’m off to the pub with the Oxford guys…


Hobolth, A. (2008). A Markov Chain Monte Carlo Expectation Maximization Algorithm for Statistical Analysis of DNA Sequence Evolution with Neighbor-Dependent Substitution Rates. Journal of Computational & Graphical Statistics, 17(1), 138-162. DOI: 10.1198/106186008X289010

What can we learn from genome-wide association studies conducted so far?

Monday, April 7th, 2008

ResearchBlogging.orgGenome-wide association studies rely on the “common disease / common variant” hypothesis: that the major genetic effect of a common disease (major effect at the population level, not necessarily for the individual) is caused by a few genetic variants that are common in the population. It is of course a parsimonious explanation, explaining the high disease frequency with a few common variants rather than many rare variants, but there are also population genetics arguments in favour of this hypothesis. Is it true, then? Can we conclude that, from the association mapping studies published the last year and a half? If we had found absolutely nothing we would probably reject it, but we have found something, just not enough to explain all the genetic contribution to the diseases we have studied, so we haven’t really answered the question yet.This paper makes an attempt at answering the question:

What Can Genome-Wide Association Studies Tell Us about the Genetics of Common DiseaseMark M. Iles. PLoS Genet 4(2): e33. doi:10.1371/journal.pgen.0040033

Abstract

The success of genome-wide association studies relies on much of the risk of common diseases being due to common genetic variants; but evidence for this is inconclusive. The results of published genome-wide association studies are examined to see what can be learnt about the distribution of disease-associated variants and how this might influence future study design. Although replicated disease-associated variants tend to be very common and frequency is inversely correlated with estimated effect size, our simulations suggest that such observations are the result of power. We find that for studies conducted to date, the frequency and effect size of significantly associated alleles are likely to be similar to those of the underlying disease alleles that they represent. Little of the genetic variation of disease has been explained so far, but current studies are only adequately powered to detect very common alleles unless they greatly increase disease risk. Thus, although the truth of the common disease / common variant hypothesis remains undecided, recent successes suggest that there are many more common genetic disease-associated variants, requiring larger studies to be identified.

First, the author notices that there is a negative correlation between the strength of the genetic and the allele frequency of the increased-risk variants in the published studies. One could argue that selection is the cause of this: if there is selection against the disease, then the disease variant will be kept to low frequencies, but there is also a negative correlation between the minor allele frequency of the disease marker and the genetic effect even when the increased-risk allele is the major allele, which could instead suggest that the correlation is caused by the statistical power to detect low-frequency disease markers: only for high genetic effects have we observed any.I’m not completely comfortable with this argument myself. At the very least, it should be argued that the effect is not simply a consequence of the at risk allele being the minor allele more often than by chance, but anyway the argument is not essential for the rest of the paper, where the power question is addressed.The big question is: would we see different distributions of discovered disease allele frequencies if the main genetic component is a few common variants (common disease/common variant) or if the genetic component was caused by several low frequency variants?The question is addressed through a simulation study, where data is simulated with i) mainly low-frequent disease variants, ii) some low-frequent and some high-frequent disease variants, and iii) mainly high-frequent variants. An association test is performed, and the frequencies of the significant markers is examined.If there is a difference, the distribution that looks the most like the observed distribution from existing studies would be the most likely explanation for the real genetic architecture underlying common diseases.As it turns out, the frequencies of the detected markers are not different under the three setups unless either the genetic effect is strong (genetic relative risk GRR >= 2) or the sample size is large (n=3000).  While we have several studies with high enough sample sizes, the effect sizes we have seen so far are rather small (GRR from 1.1 to 1.5 or so), so we are in the ranger where we might be able to see a difference, but not where we are guaranteed to see it.In other words: with the sample size we have used so far, we do not really have the power to detect the rare variants that would tell us if the common disease/common variant hypothesis is true or not.  Regardless of whether it is true, or whether more low-frequency alleles contribute the major part to the genetic component of a disease, we would see the same distribution of frequencies as we have observed so far.I will conclude with the final paragraph from the paper:

For now, it is unlikely that much can be inferred about the CDCV hypothesis from the results of GWA studies. The successes in finding common variants associated with common diseases are encouraging, but, as our findings show, we cannot yet be sure whether the common disease-associated variants found so far represent the tip of the iceberg or the bottom of the barrel.

which is essentially where the post started…


Iles, M.M. (2008). What Can Genome-Wide Association Studies Tell Us about the Genetics of Common Disease. PLoS Genetics, 4(2), e33. DOI: 10.1371/journal.pgen.0040033 

Bayesian interaction mapping

Thursday, April 3rd, 2008

ResearchBlogging.org
If you want to test for an association between a genetic marker and a disease phenotype, and you want to test this genome wide, you need to test a lot of markers. To separate random association from true association, you need to correct for the number of tests you are doing, and the more tests you are doing the more extreme your test statistic needs to be before you consider it a signal rather than random noise. For a genome wide test, you will need about half a million tests, and that is a lot of tests. It is nothing, however, compared to the number of tests you need if you also want to take interaction between markers into account. The number of pairs of markers grows as the number of markers squared, the number of triples as the number of markers cubed, and so on. If you want to test all pairs of 500,000 markers, you need 124,999,750,000 — almost 125 billion — tests. You need extremely low p-values to call anything significant with that many tests.

Wouldn’t it be great if you could get out of the multiple testing problem? Test everything in a single test and, as if by magic, highlight the probable marginal or interacting signals?

A Bayesian approach

Theoretically, that is possible through Bayesian statistics. There you can build a model that, instead of needing a lot of independent tests, directly tells you the probability that each marker, or combination of markers, contributes to the disease risk.

In this paper, they develop such a model:

Bayesian inference of epistatic interactions in case-control studies

Yu Zhang and Jun S Liu

Nature Genetics 39, 1167 – 1173 (2007)

 

Abstract

Epistatic interactions among multiple genetic variants in the human genome may be important in determining individual susceptibility to common diseases. Although some existing computational methods for identifying genetic interactions have been effective for small-scale studies, we here propose a method, denoted ‘bayesian epistasis association mapping’ (BEAM), for genome-wide case-control studies. BEAM treats the disease-associated markers and their interactions via a bayesian partitioning model and computes, via Markov chain Monte Carlo, the posterior probability that each marker set is associated with the disease. Testing this on an age-related macular degeneration genome-wide association data set, we demonstrate that the method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.

The model is very simple and is based on spotting differences in genotype frequencies in cases and controls. It splits the set of all markers into three sets: those with no association with the phenotype, those with marginal (no interaction) association with the disease, and those interacting. For those with no disease association, the model says that the genotype frequencies should be the same for cases and controls. For those with marginal effects, cases and controls have different distributions, but there is no interaction effects, and the last class has frequencies that are 1) different between cases and controls and 2) frequencies for, say, pairs of markers are not just the product of the marginal frequencies.

This isn’t different from any old frequentist approach, but there you would maximise the likelihood under different models and compare them. Alternatively, which is the Bayesian approach, you can give the frequencies a prior distribution (in a setting like this you would use the Dirichlet distribution which is the conjugate prior) and then integrate over all possible frequencies. This gives you the likelihood of the model, without the parameters (the frequencies).

Of course, you still need to figure out how to partition the markers into the three classes. The approach they take to this is to construct an MCMC that can explore the space of possible splits in a way that lets you sample classifications from the distribution you want: the classifications are seen in the proportion that matches their probability of being correct.

Does it work, then?

The model, constructed this way, is very simple. It is all based on simple frequencies so there is nothing complicated going on there, and because of the conjugate Dirichlet priors, it is computationally efficient to compute the different configuration likelihoods, so you can very efficiently move through the state space.

I am still somewhat sceptical, though.

The state space is very large. I mean very very very large. Moving beyond astronomical numbers. Even economical numbers. Big! State! Space! How many ways can you group n elements in three groups? 3n. That’s 3 to the power of n. If n is 100, that is about 5×1047. If n is 500,000 it is as close to infinity as you could possibly want. It’s a pretty large state space to explore, right?

Just because the state space is large doesn’t mean that we cannot explore the important parts of it, of course. We are no going to see most of the points in the state space, so we are going to conclude that a lot of states have zero probability, even if they probably do not, but that is not much of a problem if the probabilities are very small anyway. We only have a problem if we say that at state that should have a high (or moderate) probability actually has zero probability.

The cool thing about MCMCs is that the sometimes can explore the important parts of a state space, even if the state space is extremely large. The only reason they have any kind of success in this paper is exactly because MCMCs are cool this way.

The catch is that the state space needs to have some kind of structure to it for this to work.  There needs to be some kind of “locality” (in lack of a better word) such that if you are close to a high likelihood state you are more likely to move into a more likely one that to an arbitrary one.  If the states have probabilities essentially independent of their neighbours, then the MCMC is essentially just guessing at random, and if you guess at random you are not going to find, say, the two interacting markers out of the 125 billion pairs.  You are not better off with an MCMC than with a deck of Tarot cards.

I think the approach described in the paper is very cool, but in all honesty I don’t think it will work.  I do not believe that there is sufficient structure in the data that it is possible to search the state space in this way.  I’ll be very happy to be proven wrong, though, ’cause it is methods like this that could get us out of “multiple testing hell”.


Zhang, Y., Liu, J.S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39(9), 1167-1173. DOI: 10.1038/ng2110

Entropy and epistasis

Wednesday, March 26th, 2008

ResearchBlogging.orgFor our journal club tomorrow we are reading yet another paper on gene-gene interaction in association mapping. This time, a rather short and easy paper:

Exploration of gene-gene interaction effects using entropy-based methods
Dong et al.
European Journal of Human Genetics (2008) 16, 229–235; doi:10.1038/sj.ejhg.5201921

Abstract

Gene–gene interaction may play important roles in complex disease studies, in which interaction effects coupled with single-gene effects are active. Many interaction models have been proposed since the beginning of the last century. However, the existing approaches including statistical and data mining methods rarely consider genetic interaction models, which make the interaction results lack biological or genetic meaning. In this study, we developed an entropy-based method integrating two-locus genetic models to explore such interaction effects. We performed our method to simulated and real data for evaluation. Simulation results show that this method is effective to detect gene–gene interaction and, furthermore, it is able to identify the best-fit model from various interaction models. Moreover, our method, when applied to malaria data, successfully revealed negative epistatic effect between sickle cell anemia and α+-thalassemia against malaria.

In this paper they use (information theoretic) entropy measures to detect pairwise gene-gene interaction in disease association. In information theory, entropy is a measure of uncertainty. The less certain you are of the outcome of an experiment, the higher the entropy of the experiment. If you are flipping an unbiased coin, the chances of head or tail are 50/50 and you have maximal entropy, but if the coin is biased, you expect, say, heads to come up more often than tail, and you have less entropy. In the extreme case where you are guaranteed, say, head, the outcome is certain and the entropy is minimal. Zero, in fact.

Mathematically, if the probability of head is p, then the entropy of the coin flipping is H = p log p + (1-p)log(1-p).

If you sample an individual from the population and test if he has a certain disease, he might have that with a probability p. This isn’t that different from flipping a coin (although it would probably be a pretty biased coin for most diseases). So again you can talk about the entropy, and the formula is, of course, the same as above.

Now comes the interesting part. If you know the genotype of the individual, does that then influence the entropy of the event? Do you gain any information about disease status from knowing the genotype?

If we take the entropy for the disease fraction, but for each genotype in isolation (so we get a risk pAA for genotype AA, a risk pAa for genotype Aa, and a risk paa for genotype aa and can calculate the entropy for each of these using the formula above) and we then take the weighted average of entropies, weighted with the genotype frequencies, then we get the entropy conditional on the genotype. Comparing this entropy with the entropy when we do not know the genotype will tell us if we gain anything from knowing the genotype. If we do, then we have a genotype/phenotype association.

If we have two genotypes, we can compare the entropy when we know the combined genotype against the entropy when we know either one alone. If the combined genotype has more information than the most informative marginal genotype (i.e. less entropy than the marginal with less entropy), then there must be some interaction.

It is as simple as that. I am a bit surprised someone hasn’t done this ages ago.

Of course, there are some serious limitations with the method that might explain this.

First of all, there is the problem with distinguishing between random signals and true signals.  Even with no interaction, there is not exactly zero information gain in knowing the pairwise genotypes.  By chance, there will be different values, and you need to know when the information gain is significant.  To figure this out, they use a permutation test.  They resample from their data and that way figure out the distribution of information gain when there is no real association, and from that they can figure out  how significant the information gain is.

The problem with this is that it can be very slow.  The more significant an event needs to be before you trust it, the more you have to sample.  If you need an event to happen less than 1 in 100, you need to sample at least 100 times without seeing it to conclude that. If we need the event to occur less than once in 10,000 we must sample 10,000 times.

Still, the method is fine for detecting interaction for two given markers, but the typical situation is, of course, that you have a lot of markers and you want to figure out which are interacting. To figure that out, we need to test them all, at least unless we can rule some pairs out somehow.  With N markers, there are N(N-1)/2 pairs.  If you are looking genome wide, N would be around 500,000 which would give you 124,999,750,000 pairs.  That is a lot of pairs.

It is probably a problem for all methods to test that many pairs — at least I cannot think of any method that wouldn’t choke on it — so to be fair let us assume that we have reduced it to just one million pairs.  Then you are performing one million tests.  Now we run into the multiple testing problem. If an event happens once in ten thousand by chance, it will happen about a hundred times in a million tests.

Now we see the problem with determining significance using a permutation test.  We need to correct for multiple tests, and a  lot of multiple tests, so we need the events we consider significant to be very rare indeed.  This means that we need to sample very many times to determine that an event is significant.  This can be a very serious limitation with this method.

It might be possible to determine significance some other way, in which case the interaction test in this paper could be useful for finding gene-gene interaction, but I am sceptical as long as it relies on a permutation test…


Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., Huang, W., Li, Y. (2008). Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16(2), 229-235. DOI: 10.1038/sj.ejhg.5201921