Archive for April, 2008

Preview of CLC Genomics Workbench

Thursday, April 10th, 2008

I’m not planning on turning my blog into a commercial for CLC Bio (with whom I have no affiliation at all, trust me), but I’d like to show this video showing a preview of their new software, to be released shortly:

It’s not something I think I will be using myself — I do mainly theoretical work and on the rare occasions where I have to analyse “real data” I usually need custom software anyway — but the presenter, Mikkel Nygaard Ravn, is an old friend from my computer science days. I haven’t seen him for ages so it was fun watching him on youtube and I just had to link to it.

Two posts you should read

Tuesday, April 8th, 2008

Here’s two posts on Genetic Future that I think you should read:

Both posts tell stories of how genome wide association searches are harder than we hoped.  A post I’ve linked to before — Why do genome-wide scans fail? — tries to explain why.

I might return to the height stuff later — in our group we have analysed the DeCODE dataset using our multi-locus methods — but I’ll have to leave that for later, ’cause I am late for work now…

What can we learn from genome-wide association studies conducted so far?

Monday, April 7th, 2008

ResearchBlogging.orgGenome-wide association studies rely on the “common disease / common variant” hypothesis: that the major genetic effect of a common disease (major effect at the population level, not necessarily for the individual) is caused by a few genetic variants that are common in the population. It is of course a parsimonious explanation, explaining the high disease frequency with a few common variants rather than many rare variants, but there are also population genetics arguments in favour of this hypothesis. Is it true, then? Can we conclude that, from the association mapping studies published the last year and a half? If we had found absolutely nothing we would probably reject it, but we have found something, just not enough to explain all the genetic contribution to the diseases we have studied, so we haven’t really answered the question yet.This paper makes an attempt at answering the question:

What Can Genome-Wide Association Studies Tell Us about the Genetics of Common DiseaseMark M. Iles. PLoS Genet 4(2): e33. doi:10.1371/journal.pgen.0040033

Abstract

The success of genome-wide association studies relies on much of the risk of common diseases being due to common genetic variants; but evidence for this is inconclusive. The results of published genome-wide association studies are examined to see what can be learnt about the distribution of disease-associated variants and how this might influence future study design. Although replicated disease-associated variants tend to be very common and frequency is inversely correlated with estimated effect size, our simulations suggest that such observations are the result of power. We find that for studies conducted to date, the frequency and effect size of significantly associated alleles are likely to be similar to those of the underlying disease alleles that they represent. Little of the genetic variation of disease has been explained so far, but current studies are only adequately powered to detect very common alleles unless they greatly increase disease risk. Thus, although the truth of the common disease / common variant hypothesis remains undecided, recent successes suggest that there are many more common genetic disease-associated variants, requiring larger studies to be identified.

First, the author notices that there is a negative correlation between the strength of the genetic and the allele frequency of the increased-risk variants in the published studies. One could argue that selection is the cause of this: if there is selection against the disease, then the disease variant will be kept to low frequencies, but there is also a negative correlation between the minor allele frequency of the disease marker and the genetic effect even when the increased-risk allele is the major allele, which could instead suggest that the correlation is caused by the statistical power to detect low-frequency disease markers: only for high genetic effects have we observed any.I’m not completely comfortable with this argument myself. At the very least, it should be argued that the effect is not simply a consequence of the at risk allele being the minor allele more often than by chance, but anyway the argument is not essential for the rest of the paper, where the power question is addressed.The big question is: would we see different distributions of discovered disease allele frequencies if the main genetic component is a few common variants (common disease/common variant) or if the genetic component was caused by several low frequency variants?The question is addressed through a simulation study, where data is simulated with i) mainly low-frequent disease variants, ii) some low-frequent and some high-frequent disease variants, and iii) mainly high-frequent variants. An association test is performed, and the frequencies of the significant markers is examined.If there is a difference, the distribution that looks the most like the observed distribution from existing studies would be the most likely explanation for the real genetic architecture underlying common diseases.As it turns out, the frequencies of the detected markers are not different under the three setups unless either the genetic effect is strong (genetic relative risk GRR >= 2) or the sample size is large (n=3000).  While we have several studies with high enough sample sizes, the effect sizes we have seen so far are rather small (GRR from 1.1 to 1.5 or so), so we are in the ranger where we might be able to see a difference, but not where we are guaranteed to see it.In other words: with the sample size we have used so far, we do not really have the power to detect the rare variants that would tell us if the common disease/common variant hypothesis is true or not.  Regardless of whether it is true, or whether more low-frequency alleles contribute the major part to the genetic component of a disease, we would see the same distribution of frequencies as we have observed so far.I will conclude with the final paragraph from the paper:

For now, it is unlikely that much can be inferred about the CDCV hypothesis from the results of GWA studies. The successes in finding common variants associated with common diseases are encouraging, but, as our findings show, we cannot yet be sure whether the common disease-associated variants found so far represent the tip of the iceberg or the bottom of the barrel.

which is essentially where the post started…


Iles, M.M. (2008). What Can Genome-Wide Association Studies Tell Us about the Genetics of Common Disease. PLoS Genetics, 4(2), e33. DOI: 10.1371/journal.pgen.0040033 

Best comedian ever

Saturday, April 5th, 2008

Bayesian interaction mapping

Thursday, April 3rd, 2008

ResearchBlogging.org
If you want to test for an association between a genetic marker and a disease phenotype, and you want to test this genome wide, you need to test a lot of markers. To separate random association from true association, you need to correct for the number of tests you are doing, and the more tests you are doing the more extreme your test statistic needs to be before you consider it a signal rather than random noise. For a genome wide test, you will need about half a million tests, and that is a lot of tests. It is nothing, however, compared to the number of tests you need if you also want to take interaction between markers into account. The number of pairs of markers grows as the number of markers squared, the number of triples as the number of markers cubed, and so on. If you want to test all pairs of 500,000 markers, you need 124,999,750,000 — almost 125 billion — tests. You need extremely low p-values to call anything significant with that many tests.

Wouldn’t it be great if you could get out of the multiple testing problem? Test everything in a single test and, as if by magic, highlight the probable marginal or interacting signals?

A Bayesian approach

Theoretically, that is possible through Bayesian statistics. There you can build a model that, instead of needing a lot of independent tests, directly tells you the probability that each marker, or combination of markers, contributes to the disease risk.

In this paper, they develop such a model:

Bayesian inference of epistatic interactions in case-control studies

Yu Zhang and Jun S Liu

Nature Genetics 39, 1167 – 1173 (2007)

 

Abstract

Epistatic interactions among multiple genetic variants in the human genome may be important in determining individual susceptibility to common diseases. Although some existing computational methods for identifying genetic interactions have been effective for small-scale studies, we here propose a method, denoted ‘bayesian epistasis association mapping’ (BEAM), for genome-wide case-control studies. BEAM treats the disease-associated markers and their interactions via a bayesian partitioning model and computes, via Markov chain Monte Carlo, the posterior probability that each marker set is associated with the disease. Testing this on an age-related macular degeneration genome-wide association data set, we demonstrate that the method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.

The model is very simple and is based on spotting differences in genotype frequencies in cases and controls. It splits the set of all markers into three sets: those with no association with the phenotype, those with marginal (no interaction) association with the disease, and those interacting. For those with no disease association, the model says that the genotype frequencies should be the same for cases and controls. For those with marginal effects, cases and controls have different distributions, but there is no interaction effects, and the last class has frequencies that are 1) different between cases and controls and 2) frequencies for, say, pairs of markers are not just the product of the marginal frequencies.

This isn’t different from any old frequentist approach, but there you would maximise the likelihood under different models and compare them. Alternatively, which is the Bayesian approach, you can give the frequencies a prior distribution (in a setting like this you would use the Dirichlet distribution which is the conjugate prior) and then integrate over all possible frequencies. This gives you the likelihood of the model, without the parameters (the frequencies).

Of course, you still need to figure out how to partition the markers into the three classes. The approach they take to this is to construct an MCMC that can explore the space of possible splits in a way that lets you sample classifications from the distribution you want: the classifications are seen in the proportion that matches their probability of being correct.

Does it work, then?

The model, constructed this way, is very simple. It is all based on simple frequencies so there is nothing complicated going on there, and because of the conjugate Dirichlet priors, it is computationally efficient to compute the different configuration likelihoods, so you can very efficiently move through the state space.

I am still somewhat sceptical, though.

The state space is very large. I mean very very very large. Moving beyond astronomical numbers. Even economical numbers. Big! State! Space! How many ways can you group n elements in three groups? 3n. That’s 3 to the power of n. If n is 100, that is about 5×1047. If n is 500,000 it is as close to infinity as you could possibly want. It’s a pretty large state space to explore, right?

Just because the state space is large doesn’t mean that we cannot explore the important parts of it, of course. We are no going to see most of the points in the state space, so we are going to conclude that a lot of states have zero probability, even if they probably do not, but that is not much of a problem if the probabilities are very small anyway. We only have a problem if we say that at state that should have a high (or moderate) probability actually has zero probability.

The cool thing about MCMCs is that the sometimes can explore the important parts of a state space, even if the state space is extremely large. The only reason they have any kind of success in this paper is exactly because MCMCs are cool this way.

The catch is that the state space needs to have some kind of structure to it for this to work.  There needs to be some kind of “locality” (in lack of a better word) such that if you are close to a high likelihood state you are more likely to move into a more likely one that to an arbitrary one.  If the states have probabilities essentially independent of their neighbours, then the MCMC is essentially just guessing at random, and if you guess at random you are not going to find, say, the two interacting markers out of the 125 billion pairs.  You are not better off with an MCMC than with a deck of Tarot cards.

I think the approach described in the paper is very cool, but in all honesty I don’t think it will work.  I do not believe that there is sufficient structure in the data that it is possible to search the state space in this way.  I’ll be very happy to be proven wrong, though, ’cause it is methods like this that could get us out of “multiple testing hell”.


Zhang, Y., Liu, J.S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39(9), 1167-1173. DOI: 10.1038/ng2110