Entropy and epistasis

ResearchBlogging.orgFor our journal club tomorrow we are reading yet another paper on gene-gene interaction in association mapping. This time, a rather short and easy paper:

Exploration of gene-gene interaction effects using entropy-based methods
Dong et al.
European Journal of Human Genetics (2008) 16, 229–235; doi:10.1038/sj.ejhg.5201921

Abstract

Gene–gene interaction may play important roles in complex disease studies, in which interaction effects coupled with single-gene effects are active. Many interaction models have been proposed since the beginning of the last century. However, the existing approaches including statistical and data mining methods rarely consider genetic interaction models, which make the interaction results lack biological or genetic meaning. In this study, we developed an entropy-based method integrating two-locus genetic models to explore such interaction effects. We performed our method to simulated and real data for evaluation. Simulation results show that this method is effective to detect gene–gene interaction and, furthermore, it is able to identify the best-fit model from various interaction models. Moreover, our method, when applied to malaria data, successfully revealed negative epistatic effect between sickle cell anemia and α+-thalassemia against malaria.

In this paper they use (information theoretic) entropy measures to detect pairwise gene-gene interaction in disease association. In information theory, entropy is a measure of uncertainty. The less certain you are of the outcome of an experiment, the higher the entropy of the experiment. If you are flipping an unbiased coin, the chances of head or tail are 50/50 and you have maximal entropy, but if the coin is biased, you expect, say, heads to come up more often than tail, and you have less entropy. In the extreme case where you are guaranteed, say, head, the outcome is certain and the entropy is minimal. Zero, in fact.

Mathematically, if the probability of head is p, then the entropy of the coin flipping is H = p log p + (1-p)log(1-p).

If you sample an individual from the population and test if he has a certain disease, he might have that with a probability p. This isn't that different from flipping a coin (although it would probably be a pretty biased coin for most diseases). So again you can talk about the entropy, and the formula is, of course, the same as above.

Now comes the interesting part. If you know the genotype of the individual, does that then influence the entropy of the event? Do you gain any information about disease status from knowing the genotype?

If we take the entropy for the disease fraction, but for each genotype in isolation (so we get a risk pAA for genotype AA, a risk pAa for genotype Aa, and a risk paa for genotype aa and can calculate the entropy for each of these using the formula above) and we then take the weighted average of entropies, weighted with the genotype frequencies, then we get the entropy conditional on the genotype. Comparing this entropy with the entropy when we do not know the genotype will tell us if we gain anything from knowing the genotype. If we do, then we have a genotype/phenotype association.

If we have two genotypes, we can compare the entropy when we know the combined genotype against the entropy when we know either one alone. If the combined genotype has more information than the most informative marginal genotype (i.e. less entropy than the marginal with less entropy), then there must be some interaction.

It is as simple as that. I am a bit surprised someone hasn't done this ages ago.

Of course, there are some serious limitations with the method that might explain this.

First of all, there is the problem with distinguishing between random signals and true signals.  Even with no interaction, there is not exactly zero information gain in knowing the pairwise genotypes.  By chance, there will be different values, and you need to know when the information gain is significant.  To figure this out, they use a permutation test.  They resample from their data and that way figure out the distribution of information gain when there is no real association, and from that they can figure out  how significant the information gain is.

The problem with this is that it can be very slow.  The more significant an event needs to be before you trust it, the more you have to sample.  If you need an event to happen less than 1 in 100, you need to sample at least 100 times without seeing it to conclude that. If we need the event to occur less than once in 10,000 we must sample 10,000 times.

Still, the method is fine for detecting interaction for two given markers, but the typical situation is, of course, that you have a lot of markers and you want to figure out which are interacting. To figure that out, we need to test them all, at least unless we can rule some pairs out somehow.  With N markers, there are N(N-1)/2 pairs.  If you are looking genome wide, N would be around 500,000 which would give you 124,999,750,000 pairs.  That is a lot of pairs.

It is probably a problem for all methods to test that many pairs -- at least I cannot think of any method that wouldn't choke on it -- so to be fair let us assume that we have reduced it to just one million pairs.  Then you are performing one million tests.  Now we run into the multiple testing problem. If an event happens once in ten thousand by chance, it will happen about a hundred times in a million tests.

Now we see the problem with determining significance using a permutation test.  We need to correct for multiple tests, and a  lot of multiple tests, so we need the events we consider significant to be very rare indeed.  This means that we need to sample very many times to determine that an event is significant.  This can be a very serious limitation with this method.

It might be possible to determine significance some other way, in which case the interaction test in this paper could be useful for finding gene-gene interaction, but I am sceptical as long as it relies on a permutation test...


Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., Huang, W., Li, Y. (2008). Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16(2), 229-235. DOI: 10.1038/sj.ejhg.5201921

Tags: , , , , ,

2 Responses to “Entropy and epistasis”

  1. Bob O'H Says:

    "It is as simple as that. I am a bit surprised someone hasn’t done this ages ago."

    Oh, we did. We just use likelihood rather than entropy. :-) This also means permutation tests aren't needed, because there are good approximations to the distribution of a likelihood ratio.

    I think this is one area where the Bayesian approach really pays dividends. You can do all the tests together, and look at models with >1 effect in it. And get confidence intervals etc. at the same time.

  2. Thomas Mailund Says:

    We are using likelihood ration tests in our group as well, when looking at pair-wise interaction. Our only problem there is that the small counts for some genotype combinations tend to mess things up. Using a chi-square approximation to the distribution is a problem then, but a Fisher exact test is just not feasible on the size of data needed to have any kind of power to detect interaction.

    We are also looking at Bayesian approaches, but the implicit penalty for models with several parameters can cause some problems for us here, where it limits the power -- sort of the over-fitting problem but in reverse. Averaging over models just makes this worse.

    Still, my feeling is that Bayesian approaches is the way to go for more complex association mapping approaches where you are not likely to know null distributions.

    Looking for Bayes factors > 1 is not good enough, though. You will see lots and lots of such cases by chance, so you still need to decide on what a significant Bayes factor is. This is something we haven't found a good answer for yet, and our collaborators always want p-values, so really we want to be able to translate Bayes factors into p-values which really defeats the purpose...

Leave a Reply