Rocket science is for kids, bioinformatics is for scientists

ClC stickerYesterday I visited CLC Bio for a lunch with Roald Forsberg, but afterwards I had a discussion with Bjarne Knudsen about HMM implementations using SIMD instructions. I have student working on it for his thesis and CLC Bio is using it in their software and (I think) their CLC Cube.

Anyway, Bjarne gave me a sticker with a slogan I just loved: “Rocket science is for kids, bioinformatics is for scientists”. So I asked for a png to put on my blog, and here it is! You’ll also find it on the sidebar to the right.

Update: Since Lasse is giving me the OK (see the comment), I’ll put up links for the three sizes of the sticker I have:

  1. CLC Bio sticker size 120×90
  2. CLC Bio sticker size 180×150
  3. CLC Bio sticker size 195×120

Log-linear MDR

For our association-mapping journal-club tomorrow we are reading

Log-linear model-based multifactor dimensionality reduction method to detect gene-gene interactions. Lee et al. Bioinformatics 2007 23(19): 2589-2595

It is “yet another extension” of the multifactor dimensionality reduction (MDR) method, a method for association mapping that is quite popular (judging from the number of publications of extensions of it).

I think I’ll write something more about MDR later, in another post, but for now I’ll just describe it very briefly.


The idea is simply to reduce genotypes (and any co-variates) to a simple high- or low-risk classification, and then use cross-validation to find the best way of classifying cases and controls. So rather than dealing with high-dimensional genotypes, you initially only care about whether a given genotype in general has a higher or lower risk of being a case (in your data set), in the sense that the ration of cases to controls, among the individuals with the given genotype, is higher or lower than some given threshold.

Anyway, based on this reduction of dimensions you then consider all sub-sets of explanatory variables (for example all subsets of markers) and try to build models from this and then see how well they explain the phenotypes. The best model is the one that performs the best in the cross-validation.

The log-linear extension

The extension in this particular paper is in how the model for predicting the phenotype is constructed. Rather than using simply the ratios of cases to controls to predict phenotypes (going with the majority of the individuals with the given genotype) the new method uses a log-linear model to predict the phenotype. This has the benefit that the model has fewer parameters and is 1) less likely to over-fit because of it and 2) you have fewer problems with dealing with small counts in the contingency table because the parameters are estimated from several cells in the contingency table.

One thing that confuses me with the paper, though, is the model they use. The predict the (log of the) contingency table cell-counts with a model that looks like


where X and Y are markers (with three genotypes each) and D is the disease status. In this model, they have count contributions for the markers independent of the disease (the lambdas with super script X and Y), you have a contribution from the disease status (the lambda with super script D) that captures the general count of the affected vs unaffected, you have marginal effects (the lambdas with XD and YD) and you have an interaction between the markers (the lambda with XY) that essentially captures linkage disequilibrium (LD).

As far as I can see you do not capture the effect on phenotype that the interaction of X and Y has! If you want to capture the interaction effect the two markers have on the disease status, you want to capture D | XY, which is being ignored. The model captures how the two markers interact in general in the sample, which is essentially just LD, but not how the interaction affects the disease risk.

In the discussion, the authors write (emphasis mine):

In summary, when there is high-order epistasis in the absence of marginal effects, the LM MDR procedure that includes finding a parsimonious model provides a similar result to that of the original MDR approach. Otherwise, LM MDR provides a better result than the original MDR approach, because one of the strengths of the unsaturated log-linear model is to estimate the cell frequencies in the sparse or empty cells when the unsaturatedmodel gives a good fit to the data.

My guess as to why this method does not perform better than the old method in the absence of marginal effects — even when there is significant gene-gene interaction (epistatsis) — is exactly that the model does not capture gene-gene interaction. It is exactly what is left out!

I am all for using models with fewer parameters, but I think they are leaving out the essential parameter here…

Citation for Research Blogging:Yeoun Lee, S., Chung, Y., Elston, R.C., Kim, Y., Park, T. (2006). Log-linear model-based multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics, 23(19), 2589-2595. DOI: 10.1093/bioinformatics/btm396

Estimating parameters of speciation models

Another paper that addresses the speciation process in apes is:

A new approach to estimate parameters of speciation models with application to apes

Becquet and Przeworski

Genome Research 17:1505-1519


How populations diverge and give rise to distinct species remains a fundamental question in evolutionary biology, with important implications for a wide range of fields, from conservation genetics to human evolution. A promising approach is to estimate parameters of simple speciation models using polymorphism data from multiple loci. Existing methods, however, make a number of assumptions that severely limit their applicability, notably, no gene flow after the populations split and no intralocus recombination. To overcome these limitations, we developed a new Markov chain Monte Carlo method to estimate parameters of an isolation-migration model. The approach uses summaries of polymorphism data at multiple loci surveyed in a pair of diverging populations or closely related species and, importantly, allows for intralocus recombination. To illustrate its potential, we applied it to extensive polymorphism data from populations and species of apes, whose demographic histories are largely unknown. The isolation-migration model appears to provide a reasonable fit to the data. It suggests that the two chimpanzee species became reproductively isolated in allopatry ~850 Kya, while Western and Central chimpanzee populations split ~440 Kya but continued to exchange migrants. Similarly, Eastern and Western gorillas and Sumatran and Bornean orangutans appear to have experienced gene flow since their splits ~90 and over 250 Kya, respectively.

becquet-przeworski-fig1.pngIn this they develop a method to infer the coalescence parameters in a model that is essentially a population split with migration (click on the figure for details).

The effective population sizes, the Ns, tells us something about the diversity of the species (where NA tells us about the ancestral species). The split time, T, gives us the speciation time, and the migration parameter, m, tells us something about the way the speciation occured (an allopatric vs parapatric model).

As usual for coalescence models, the full likelihood of the parameters is computational demanding to compute, so the authors use summary statistics instead — somewhat like an Approximate Bayesian Computation (ABC) method if you can call it that when you want to match the summaries exactly — and then develop a Markov Chain Monte Carlo (MCMC) method to sample from the likelihood function over the summary statistics.

Based on this model, they then estimate speciation times for sub-species of chimps, gorillas and orangutans.

Citation for Research Blogging:Becquet, C., Przeworski, M. (2007). A new approach to estimate parameters of speciation models with application to apes. Genome Research, 17(10), 1505-1519. DOI: 10.1101/gr.6409707

Playing with themes

I’ve been playing with the theme for my blog today. The one I used I had hacked up to the point where it was getting downright ugly, so I went looking for a new one. There are lots of themes for WordPress, but I don’t really like any of them. Unfortunately I am pretty bad at designing webpages myself, so I have to pick one of the existing ones.

This one isn’t so bad, I think. It is pretty simple, it lets me design the side-bars using widgets, and the blockquotes look reasonable for paper citations, something I use them for a lot. I don’t much like the font, but when I start messing with the font myself I really mess up the page.

I’ll stick with this theme for a while now, but probably go looking for a new one next time I have an afternoon I don’t know what to do with.