Exploration of gene-gene interaction effects using entropy-based methods
Dong et al.
European Journal of Human Genetics (2008) 16, 229–235; doi:10.1038/sj.ejhg.5201921
Gene–gene interaction may play important roles in complex disease studies, in which interaction effects coupled with single-gene effects are active. Many interaction models have been proposed since the beginning of the last century. However, the existing approaches including statistical and data mining methods rarely consider genetic interaction models, which make the interaction results lack biological or genetic meaning. In this study, we developed an entropy-based method integrating two-locus genetic models to explore such interaction effects. We performed our method to simulated and real data for evaluation. Simulation results show that this method is effective to detect gene–gene interaction and, furthermore, it is able to identify the best-fit model from various interaction models. Moreover, our method, when applied to malaria data, successfully revealed negative epistatic effect between sickle cell anemia and α+-thalassemia against malaria.
In this paper they use (information theoretic) entropy measures to detect pairwise gene-gene interaction in disease association. In information theory, entropy is a measure of uncertainty. The less certain you are of the outcome of an experiment, the higher the entropy of the experiment. If you are flipping an unbiased coin, the chances of head or tail are 50/50 and you have maximal entropy, but if the coin is biased, you expect, say, heads to come up more often than tail, and you have less entropy. In the extreme case where you are guaranteed, say, head, the outcome is certain and the entropy is minimal. Zero, in fact.
Mathematically, if the probability of head is p, then the entropy of the coin flipping is H = p log p + (1-p)log(1-p).
If you sample an individual from the population and test if he has a certain disease, he might have that with a probability p. This isn't that different from flipping a coin (although it would probably be a pretty biased coin for most diseases). So again you can talk about the entropy, and the formula is, of course, the same as above.
Now comes the interesting part. If you know the genotype of the individual, does that then influence the entropy of the event? Do you gain any information about disease status from knowing the genotype?
If we take the entropy for the disease fraction, but for each genotype in isolation (so we get a risk pAA for genotype AA, a risk pAa for genotype Aa, and a risk paa for genotype aa and can calculate the entropy for each of these using the formula above) and we then take the weighted average of entropies, weighted with the genotype frequencies, then we get the entropy conditional on the genotype. Comparing this entropy with the entropy when we do not know the genotype will tell us if we gain anything from knowing the genotype. If we do, then we have a genotype/phenotype association.
If we have two genotypes, we can compare the entropy when we know the combined genotype against the entropy when we know either one alone. If the combined genotype has more information than the most informative marginal genotype (i.e. less entropy than the marginal with less entropy), then there must be some interaction.
It is as simple as that. I am a bit surprised someone hasn't done this ages ago.
Of course, there are some serious limitations with the method that might explain this.
First of all, there is the problem with distinguishing between random signals and true signals. Even with no interaction, there is not exactly zero information gain in knowing the pairwise genotypes. By chance, there will be different values, and you need to know when the information gain is significant. To figure this out, they use a permutation test. They resample from their data and that way figure out the distribution of information gain when there is no real association, and from that they can figure out how significant the information gain is.
The problem with this is that it can be very slow. The more significant an event needs to be before you trust it, the more you have to sample. If you need an event to happen less than 1 in 100, you need to sample at least 100 times without seeing it to conclude that. If we need the event to occur less than once in 10,000 we must sample 10,000 times.
Still, the method is fine for detecting interaction for two given markers, but the typical situation is, of course, that you have a lot of markers and you want to figure out which are interacting. To figure that out, we need to test them all, at least unless we can rule some pairs out somehow. With N markers, there are N(N-1)/2 pairs. If you are looking genome wide, N would be around 500,000 which would give you 124,999,750,000 pairs. That is a lot of pairs.
It is probably a problem for all methods to test that many pairs -- at least I cannot think of any method that wouldn't choke on it -- so to be fair let us assume that we have reduced it to just one million pairs. Then you are performing one million tests. Now we run into the multiple testing problem. If an event happens once in ten thousand by chance, it will happen about a hundred times in a million tests.
Now we see the problem with determining significance using a permutation test. We need to correct for multiple tests, and a lot of multiple tests, so we need the events we consider significant to be very rare indeed. This means that we need to sample very many times to determine that an event is significant. This can be a very serious limitation with this method.
It might be possible to determine significance some other way, in which case the interaction test in this paper could be useful for finding gene-gene interaction, but I am sceptical as long as it relies on a permutation test...
Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., Huang, W., Li, Y. (2008). Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16(2), 229-235. DOI: 10.1038/sj.ejhg.5201921