Archive for March, 2008

You know, people do use neighbour joining!

Thursday, March 27th, 2008

Over the last couple of years, I have done a little work on phylogeny inference, including a few papers on neighbour joining.  One thing that consistently happens when you submit a paper on this — and I bring it up because I have just gotten back reviewer reports on such a  paper — is that at least one reviewer will tell you that neighbour joining is not interesting and one should focus on maximum likelihood / Bayesian trees instead.

Sorry to say it, but people do use neighbour joining — I am willing to bet that there are ten times as many people using neighbour joining to infer trees than there are people using the statistical approaches — so algorithmical improvements here do matter!

The statistical approaches are usually more accurate, and they are better at capturing the uncertainty in the inference and such, but they are slow! Not slow as in, “I’ll go get a cup of coffee while the program finish”, but slow as in “I’ll look at the tree when I am back from my vacation”.

Sure, they are fast enough for tens of leaves, but some people infer trees with thousands of leaves.  I recently got an email from a guy who tried with tens of thousands of leaves and ran out of memory using one of my tools — it needed more than 4G so it chocked on the problem (but a student in our lap has now come up with a new algorithm that is less memory expensive so that should solve that problem).

For large trees, forget about ML or Bayesian approaches.  They do not scale (yet).

People do use neighbour joining, so shut up and review the paper for what it is, not what you want it to be. Grrr!

Entropy and epistasis

Wednesday, March 26th, 2008

ResearchBlogging.orgFor our journal club tomorrow we are reading yet another paper on gene-gene interaction in association mapping. This time, a rather short and easy paper:

Exploration of gene-gene interaction effects using entropy-based methods
Dong et al.
European Journal of Human Genetics (2008) 16, 229–235; doi:10.1038/sj.ejhg.5201921

Abstract

Gene–gene interaction may play important roles in complex disease studies, in which interaction effects coupled with single-gene effects are active. Many interaction models have been proposed since the beginning of the last century. However, the existing approaches including statistical and data mining methods rarely consider genetic interaction models, which make the interaction results lack biological or genetic meaning. In this study, we developed an entropy-based method integrating two-locus genetic models to explore such interaction effects. We performed our method to simulated and real data for evaluation. Simulation results show that this method is effective to detect gene–gene interaction and, furthermore, it is able to identify the best-fit model from various interaction models. Moreover, our method, when applied to malaria data, successfully revealed negative epistatic effect between sickle cell anemia and α+-thalassemia against malaria.

In this paper they use (information theoretic) entropy measures to detect pairwise gene-gene interaction in disease association. In information theory, entropy is a measure of uncertainty. The less certain you are of the outcome of an experiment, the higher the entropy of the experiment. If you are flipping an unbiased coin, the chances of head or tail are 50/50 and you have maximal entropy, but if the coin is biased, you expect, say, heads to come up more often than tail, and you have less entropy. In the extreme case where you are guaranteed, say, head, the outcome is certain and the entropy is minimal. Zero, in fact.

Mathematically, if the probability of head is p, then the entropy of the coin flipping is H = p log p + (1-p)log(1-p).

If you sample an individual from the population and test if he has a certain disease, he might have that with a probability p. This isn’t that different from flipping a coin (although it would probably be a pretty biased coin for most diseases). So again you can talk about the entropy, and the formula is, of course, the same as above.

Now comes the interesting part. If you know the genotype of the individual, does that then influence the entropy of the event? Do you gain any information about disease status from knowing the genotype?

If we take the entropy for the disease fraction, but for each genotype in isolation (so we get a risk pAA for genotype AA, a risk pAa for genotype Aa, and a risk paa for genotype aa and can calculate the entropy for each of these using the formula above) and we then take the weighted average of entropies, weighted with the genotype frequencies, then we get the entropy conditional on the genotype. Comparing this entropy with the entropy when we do not know the genotype will tell us if we gain anything from knowing the genotype. If we do, then we have a genotype/phenotype association.

If we have two genotypes, we can compare the entropy when we know the combined genotype against the entropy when we know either one alone. If the combined genotype has more information than the most informative marginal genotype (i.e. less entropy than the marginal with less entropy), then there must be some interaction.

It is as simple as that. I am a bit surprised someone hasn’t done this ages ago.

Of course, there are some serious limitations with the method that might explain this.

First of all, there is the problem with distinguishing between random signals and true signals.  Even with no interaction, there is not exactly zero information gain in knowing the pairwise genotypes.  By chance, there will be different values, and you need to know when the information gain is significant.  To figure this out, they use a permutation test.  They resample from their data and that way figure out the distribution of information gain when there is no real association, and from that they can figure out  how significant the information gain is.

The problem with this is that it can be very slow.  The more significant an event needs to be before you trust it, the more you have to sample.  If you need an event to happen less than 1 in 100, you need to sample at least 100 times without seeing it to conclude that. If we need the event to occur less than once in 10,000 we must sample 10,000 times.

Still, the method is fine for detecting interaction for two given markers, but the typical situation is, of course, that you have a lot of markers and you want to figure out which are interacting. To figure that out, we need to test them all, at least unless we can rule some pairs out somehow.  With N markers, there are N(N-1)/2 pairs.  If you are looking genome wide, N would be around 500,000 which would give you 124,999,750,000 pairs.  That is a lot of pairs.

It is probably a problem for all methods to test that many pairs — at least I cannot think of any method that wouldn’t choke on it — so to be fair let us assume that we have reduced it to just one million pairs.  Then you are performing one million tests.  Now we run into the multiple testing problem. If an event happens once in ten thousand by chance, it will happen about a hundred times in a million tests.

Now we see the problem with determining significance using a permutation test.  We need to correct for multiple tests, and a  lot of multiple tests, so we need the events we consider significant to be very rare indeed.  This means that we need to sample very many times to determine that an event is significant.  This can be a very serious limitation with this method.

It might be possible to determine significance some other way, in which case the interaction test in this paper could be useful for finding gene-gene interaction, but I am sceptical as long as it relies on a permutation test…


Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., Huang, W., Li, Y. (2008). Exploration of gene-gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16(2), 229-235. DOI: 10.1038/sj.ejhg.5201921

Well that’s a relief

Monday, March 24th, 2008

I was worried about my beer consumptions effect on my scientific productivity, especially today when I am slightly hung over from sampling various Easter Ales yesterday evening.

To my relief, I then see this re-analysis of the data:

But as I began to think further on the subject (and enjoy a fine Pale Ale to settle me down), I realized I was making two cardinal mistakes in my approach to this startling scientific development: 1) I trusted my limited anecdotal evidence over a statistically valid scientific study, and 2) I based my understanding of the science on a journalist’s description of a technical paper. Recognizing my initial flaws, I moved on to a smooth and especially bitter IPA and got on the internet.

 [snip]

First, there was the common mistake of confusing correlation with causation. The author implied that increased beer drinking caused reduced scientific output. An equally likely explanation is that poor performance in one’s chosen career (in this case ornithology) led to increased beer drinking (and after all, the subjects live in a country with the world’s highest per capita beer consumption). Alternatively, a third, unmeasured factor could be leading to both poor job performance and higher beer consumption (a nagging spouse, for example).

[snip]

But it was while I was switching to a magnificent Pacific Northwest microbrew porter that I saw the real problem. Looking at the graph of the 34 data points, it was clear that the entire correlation was caused by the five lowest-output scientists. Without those five data points, the remaining 29 – showing a wide range of scientific output and beer consumption habits – exhibited absolutely no correlation. Thus, the entire study came down to only one conclusion: the five worst ornithologists in the Czech Republic drank a lot of beer.

Now I’m feeling a lot better! Since the hangover is also almost gone, drowned in strong coffee, I think I’ll get started with today’s work.  I’m planning on reading Julian Faraway’s Extending the Linear Model with R, having completed Linear Models with R only a few weeks ago. (Incidentally, linear models is behind this study and the refutation!).

I’ll complete this post with the final quote from the post above:

In the end, though, I was pleased to see that careful reading and analysis of the original published work led to an easy debunking of the silly notion reported in the press that somehow beer drinking was bad for scientific performance. With the reputation of beer-loving scientists restored to its rightful glory, I sat back and sipped my double-chocolate stout. Ah, the life of a Gentleman Scientist.

How to interpret a genome wide association study

Sunday, March 23rd, 2008

I just spotted this nice review of genome wide association studies:

How to Interpret a Genome-wide Association Study
Thomas A. Pearson and Teri A. Manolio
Journal of the American Medical Association 2008;299(11):1335-1344.

It is worth a read.

Heads or tails and reliable alignments

Sunday, March 23rd, 2008

ResearchBlogging.orgI have on several occasions written about the uncertainty inherent in inferred alignments and how this is a potential problem. I hadn’t really thought it would be quite so serious as the results in the paper I just read:

Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments
Giddy Landan and Dan Graur
Molecular Biology and Evolution 2007 24(6):1380-1383; doi:10.1093/molbev/msm060

Abstract

The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

In this paper they analyse the quality of multiple sequence alignments in an extremely simple manner: They first align the sequences left to right, then reverse them to essentially align them right to left. Unless the alignment algorithm has a preferred order of symbols, you’d expect to get the same alignment going left to right as right to left.

Not always, of course: if the algorithm is based on oligonucleotides or such, then the order matters, but in many cases it doesn’t.

Comparing head and tail alignments

When the order shouldn’t matter, the left-to-right and right-to-left alignments (head and tail alignments in the paper) should be similar, so comparing them should give an indication of how much faith you can have in the inferred alignment.

They try this out on a family of 130 amino acid sequences of length around 400 using three different alignment tools. This is the result:

  ClustalW MUSCLE ProbCons
Columns 18.0% 8.7% 6.7%
Residue pairs 52.1% 53.7% 60.8%
Shared splits 64.6% 65.4% 59.1%

Here Columns denotes the fraction of identical columns in the alignment, Residue pairs denote the fraction of pairs (in a “sum of pairs” kind of way) that are identical, and Shared splits denote the fraction of identical splits (edges) in BioNJ inferred trees from the two alignments.Very few alignment columns are shared between the two alignments, but that is not that much of a problem. With 130 sequences you wouldn’t expect to match many columns exactly. I’m more surprised that the resulting pairwise alignments (the pairwise alignments you get by extracting two rows from the alignment, the identity given in Residue pairs) were so different.It is also a bit shocking that inferred trees from the two alignments were so different.

What is causing this?

There is uncertainty in inferring alignments, but why would the same algorithm give different results when running left-to-right compared to right-to-left?

As far as I can see, there are two different things going on here. One having to do with there being more than one optimal alignment (also discussed in the paper), and one having to do with heuristics in searching for optimal alignments.

When there are more than one optimal alignment (which is often the case), even algorithms guaranteed to find an optimal alignment will give you and arbitrary one (though usually a deterministic arbitrary choice). The arbitrary choice can easily differ between running left-to-right or right-to-left.

For multiple sequence alignments, it is computational infeasible to guarantee to compute an optimal alignment, and heuristics are used to search for (near or locally) optimal alignments. This is often some variation on a greedy strategy, and each choice there will potentially lead to a different alignment. It is easy to see how left-to-right and right-to-left alignments can be different with such a strategy.

In any case, the take home message is, once again: don’t trust alignments!


Landan, G., Graur, D. (2007). Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution, 24(6), 1380-1383. DOI: 10.1093/molbev/msm060