For our association-mapping journal-club tomorrow we are reading
Log-linear model-based multifactor dimensionality reduction method to detect gene-gene interactions. Lee et al. Bioinformatics 2007 23(19): 2589-2595
It is "yet another extension" of the multifactor dimensionality reduction (MDR) method, a method for association mapping that is quite popular (judging from the number of publications of extensions of it).
I think I'll write something more about MDR later, in another post, but for now I'll just describe it very briefly.
The idea is simply to reduce genotypes (and any co-variates) to a simple high- or low-risk classification, and then use cross-validation to find the best way of classifying cases and controls. So rather than dealing with high-dimensional genotypes, you initially only care about whether a given genotype in general has a higher or lower risk of being a case (in your data set), in the sense that the ration of cases to controls, among the individuals with the given genotype, is higher or lower than some given threshold.
Anyway, based on this reduction of dimensions you then consider all sub-sets of explanatory variables (for example all subsets of markers) and try to build models from this and then see how well they explain the phenotypes. The best model is the one that performs the best in the cross-validation.
The log-linear extension
The extension in this particular paper is in how the model for predicting the phenotype is constructed. Rather than using simply the ratios of cases to controls to predict phenotypes (going with the majority of the individuals with the given genotype) the new method uses a log-linear model to predict the phenotype. This has the benefit that the model has fewer parameters and is 1) less likely to over-fit because of it and 2) you have fewer problems with dealing with small counts in the contingency table because the parameters are estimated from several cells in the contingency table.
One thing that confuses me with the paper, though, is the model they use. The predict the (log of the) contingency table cell-counts with a model that looks like
where X and Y are markers (with three genotypes each) and D is the disease status. In this model, they have count contributions for the markers independent of the disease (the lambdas with super script X and Y), you have a contribution from the disease status (the lambda with super script D) that captures the general count of the affected vs unaffected, you have marginal effects (the lambdas with XD and YD) and you have an interaction between the markers (the lambda with XY) that essentially captures linkage disequilibrium (LD).
As far as I can see you do not capture the effect on phenotype that the interaction of X and Y has! If you want to capture the interaction effect the two markers have on the disease status, you want to capture D | XY, which is being ignored. The model captures how the two markers interact in general in the sample, which is essentially just LD, but not how the interaction affects the disease risk.
In the discussion, the authors write (emphasis mine):
In summary, when there is high-order epistasis in the absence of marginal effects, the LM MDR procedure that includes finding a parsimonious model provides a similar result to that of the original MDR approach. Otherwise, LM MDR provides a better result than the original MDR approach, because one of the strengths of the unsaturated log-linear model is to estimate the cell frequencies in the sparse or empty cells when the unsaturatedmodel gives a good fit to the data.
My guess as to why this method does not perform better than the old method in the absence of marginal effects -- even when there is significant gene-gene interaction (epistatsis) -- is exactly that the model does not capture gene-gene interaction. It is exactly what is left out!
I am all for using models with fewer parameters, but I think they are leaving out the essential parameter here...
Citation for Research Blogging:Yeoun Lee, S., Chung, Y., Elston, R.C., Kim, Y., Park, T. (2006). Log-linear model-based multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics, 23(19), 2589-2595. DOI: 10.1093/bioinformatics/btm396