Posts Tagged ‘epistasis’

Entropy and epistasis

Wednesday, March 26th, 2008

ResearchBlogging.orgFor our journal club tomorrow we are reading yet another paper on gene-gene interaction in association mapping. This time, a rather short and easy paper:

Exploration of gene-gene interaction effects using entropy-based methods
Dong et al.
European Journal of Human Genetics (2008) 16, 229–235; doi:10.1038/sj.ejhg.5201921

Abstract

Gene–gene interaction may play important roles in complex disease studies, in which interaction effects coupled with single-gene effects are active. Many interaction models have been proposed since the beginning of the last century. However, the existing approaches including statistical and data mining methods rarely consider genetic interaction models, which make the interaction results lack biological or genetic meaning. In this study, we developed an entropy-based method integrating two-locus genetic models to explore such interaction effects. We performed our method to simulated and real data for evaluation. Simulation results show that this method is effective to detect gene–gene interaction and, furthermore, it is able to identify the best-fit model from various interaction models. Moreover, our method, when applied to malaria data, successfully revealed negative epistatic effect between sickle cell anemia and α+-thalassemia against malaria.

In this paper they use (information theoretic) entropy measures to detect pairwise gene-gene interaction in disease association. In information theory, entropy is a measure of uncertainty. The less certain you are of the outcome of an experiment, the higher the entropy of the experiment. If you are flipping an unbiased coin, the chances of head or tail are 50/50 and you have maximal entropy, but if the coin is biased, you expect, say, heads to come up more often than tail, and you have less entropy. In the extreme case where you are guaranteed, say, head, the outcome is certain and the entropy is minimal. Zero, in fact.

Mathematically, if the probability of head is p, then the entropy of the coin flipping is H = p log p + (1-p)log(1-p).

If you sample an individual from the population and test if he has a certain disease, he might have that with a probability p. This isn't that different from flipping a coin (although it would probably be a pretty biased coin for most diseases). So again you can talk about the entropy, and the formula is, of course, the same as above.

Now comes the interesting part. If you know the genotype of the individual, does that then influence the entropy of the event? Do you gain any information about disease status from knowing the genotype?

If we take the entropy for the disease fraction, but for each genotype in isolation (so we get a risk pAA for genotype AA, a risk pAa for genotype Aa, and a risk paa for genotype aa and can calculate the entropy for each of these using the formula above) and we then take the weighted average of entropies, weighted with the genotype frequencies, then we get the entropy conditional on the genotype. Comparing this entropy with the entropy when we do not know the genotype will tell us if we gain anything from knowing the genotype. If we do, then we have a genotype/phenotype association.

If we have two genotypes, we can compare the entropy when we know the combined genotype against the entropy when we know either one alone. If the combined genotype has more information than the most informative marginal genotype (i.e. less entropy than the marginal with less entropy), then there must be some interaction.

It is as simple as that. I am a bit surprised someone hasn't done this ages ago.

Of course, there are some serious limitations with the method that might explain this.

First of all, there is the problem with distinguishing between random signals and true signals.  Even with no interaction, there is not exactly zero information gain in knowing the pairwise genotypes.  By chance, there will be different values, and you need to know when the information gain is significant.  To figure this out, they use a permutation test.  They resample from their data and that way figure out the distribution of information gain when there is no real association, and from that they can figure out  how significant the information gain is.

The problem with this is that it can be very slow.  The more significant an event needs to be before you trust it, the more you have to sample.  If you need an event to happen less than 1 in 100, you need to sample at least 100 times without seeing it to conclude that. If we need the event to occur less than once in 10,000 we must sample 10,000 times.

Still, the method is fine for detecting interaction for two given markers, but the typical situation is, of course, that you have a lot of markers and you want to figure out which are interacting. To figure that out, we need to test them all, at least unless we can rule some pairs out somehow.  With N markers, there are N(N-1)/2 pairs.  If you are looking genome wide, N would be around 500,000 which would give you 124,999,750,000 pairs.  That is a lot of pairs.

It is probably a problem for all methods to test that many pairs -- at least I cannot think of any method that wouldn't choke on it -- so to be fair let us assume that we have reduced it to just one million pairs.  Then you are performing one million tests.  Now we run into the multiple testing problem. If an event happens once in ten thousand by chance, it will happen about a hundred times in a million tests.

Now we see the problem with determining significance using a permutation test.  We need to correct for multiple tests, and a  lot of multiple tests, so we need the events we consider significant to be very rare indeed.  This means that we need to sample very many times to determine that an event is significant.  This can be a very serious limitation with this method.

It might be possible to determine significance some other way, in which case the interaction test in this paper could be useful for finding gene-gene interaction, but I am sceptical as long as it relies on a permutation test...


Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., Huang, W., Li, Y. (2008). Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16(2), 229-235. DOI: 10.1038/sj.ejhg.5201921

Statistical power and interacting genes

Sunday, March 23rd, 2008

ResearchBlogging.orgEarlier this week we discussed the paper below in our association mapping journal club. Lately we have been interested in epistasis (gene-gene interaction) in the context of association mapping -- we have just submitted a paper on the subject and have a few projects in the pipeline working on this problem -- and one problem that concerns us is the power of detecting gene-gene interaction in association mapping. This paper turned out not to really be about that, but it was interesting nonetheless.

Anyway, back to the paper:

Power of genome-wide association studies in the presence of interacting loci
Joseph Pickrell, Françoise Clerget-Darpoux, Catherine Bourgain
Genetic Epidemiology 31(7) 748 - 762

Abstract

Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and linkage disequilibrium (LD) structure. The power is also evaluated in situations where the causal single nucleotide polymorphisms (SNPs) may not be genotyped, but rather detected by proxy using a SNP in LD. A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the proportionality assumption, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable (Terwilliger and Hiekkalinna [2006] Eur J Hum Genet 14(4):426-437), in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.

The problem considered in the paper is the following: If we are searching for gene-disease association and the disease risk depends on an interaction between two variants, will we be able to detect it? I'm simplifying a bit here, but that is the essential question.

Testing single markers

The typical approach for finding genes that affect the disease risk, when analysing the entire genome in any case, is to go through each typed variant and test if the cases and controls have different distributions of genotype frequencies. I've described this in a bit more detail in an earler post, so I won't say much more on that here.

The power to detect an association when it is there, depend on several parameters, such as the allele frequencies, the sample size, and of course the strength of the effect the genotype has on the disease risk, typically measured by the genetic relative risk GRR. For a binary marker (what we typically consider), we can consider the risk of allele aa the "basic" risk (GRRaa=1) and talk about the relative risk of Aa and AA, GRRAa and GRRAA. Different "disease models" put constraint on these, e.g. a dominant model would have GRRaa=GRRAa=1 != GRRAA, but in general there are two risks that can vary in relation to the basic risk.

Gene-gene interaction

Now, if the disease risk depends on several markers, you can have various kinds of interaction. For two markers, you now have nine genotypes, {aa,Aa,AA}x{bb,Bb,BB}, with eight GRRs that can vary in relation to GRRaabb. Again, various "classical" disease models can put constraints on the GRRs.

The problem they consider in the paper is such a pair-wise interaction setup (with four different disease models), and how the power of detecting an association depends on the GRRs, disease model, allele frequencies, etc.

Detecting an association, here, means detecting an association at A or B (or both), but not detecting the right disease model, or detecting that there is really an interaction going on; it is still considered a "hit" if only one of the two markers is found to be associated with the disease. I'll get back to that below.

The way thay go about this is to calculate the marginal GRRs, i.e. the relative risks of AA and Aa when ignoring the B marker, and the GRRs of BB and Bb when ignoring the A marker. These marginal GRRs are, of course, affected by the (interaction) disease model, GRRs of the interacting pair, frequencies, etc, but once the marginal GRRs have been calculated, the power of detection can be computed as if no interaction was going on.

Indirect testing

Typically, we do not have all the variation typed, but rely on tagSNPs to indirectly test for association. The way this works is that the SNPs are correlated (this correlation is called linkage disequilibrium, LD) so the relative risk of one SNP "leaks into" a relative risk of another SNP. The GRRs of a tagSNP depend on the LD with the causal SNP(s) and the allele frequencies and is not straight forward, but as a rule of thumb there is the following relationship: if a sample size of N is needed to detect association at the causal marker, then a sample size of N / r2 is needed at the tagSNP, where r2 is a measure of LD.

Although mathematically justified, it is only a rule of thumb, and it is violated especially in the presence of interaction (where there is potentially LD between the tagSNP and both causal SNPs, to confuse the matter).

A large part of the paper is concerned with this rule of thumb, and in my opinion this is the most interesting part of the paper. We know very little about how we perform in tagging for interaction, since essentially all tagging algorithms are based on the r2 rule of thumb.

Not really about interaction

Since they define "detection of association" to be detection of a marginal association, we are not really considering power of detecting association. For the direct testing (when we are not considering tagSNPs), the interaction doesn't really come into play at all! The interaction model determines the marginal GRR, and as such it is interesting enough, but once we have the marginal GRR, there is nothing new in how we determine the power. The greater the GRR, the greater the power, but that is completely independent of interaction or not.

For the tagging consideration it is a different matter.  There the interaction has an effect, as I mentioned above, because both causal SNPs can be in LD with the tag, and that affects the r2 rule of thumb.

Still, the paper is about the power of detecting marginal association, not interaction, and it is possible (and not even that hard) to construct models where there is a strong interaction association but very little marginal effect.  For such a setup, a marginal test will never be powerful, and a full interaction model must be used.

It is the latter problem we are currently working on in my group.  How do we find pairs that interact but have little marginal effect? (we have just submitted a paper on that), what is the power to detect such interaction? and how well do we tag such interaction?


Pickrell, J., Clerget-Darpoux, F., Bourgain, C. (2007). Power of Genome-Wide Association Studies in the Presence of Interacting Loci. Genetic Epidemiology, 31, 748-762.