Heads or tails and reliable alignments

ResearchBlogging.orgI have on several occasions written about the uncertainty inherent in inferred alignments and how this is a potential problem. I hadn’t really thought it would be quite so serious as the results in the paper I just read:

Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments
Giddy Landan and Dan Graur
Molecular Biology and Evolution 2007 24(6):1380-1383; doi:10.1093/molbev/msm060


The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

In this paper they analyse the quality of multiple sequence alignments in an extremely simple manner: They first align the sequences left to right, then reverse them to essentially align them right to left. Unless the alignment algorithm has a preferred order of symbols, you’d expect to get the same alignment going left to right as right to left.

Not always, of course: if the algorithm is based on oligonucleotides or such, then the order matters, but in many cases it doesn’t.

Comparing head and tail alignments

When the order shouldn’t matter, the left-to-right and right-to-left alignments (head and tail alignments in the paper) should be similar, so comparing them should give an indication of how much faith you can have in the inferred alignment.

They try this out on a family of 130 amino acid sequences of length around 400 using three different alignment tools. This is the result:

  ClustalW MUSCLE ProbCons
Columns 18.0% 8.7% 6.7%
Residue pairs 52.1% 53.7% 60.8%
Shared splits 64.6% 65.4% 59.1%

Here Columns denotes the fraction of identical columns in the alignment, Residue pairs denote the fraction of pairs (in a “sum of pairs” kind of way) that are identical, and Shared splits denote the fraction of identical splits (edges) in BioNJ inferred trees from the two alignments.Very few alignment columns are shared between the two alignments, but that is not that much of a problem. With 130 sequences you wouldn’t expect to match many columns exactly. I’m more surprised that the resulting pairwise alignments (the pairwise alignments you get by extracting two rows from the alignment, the identity given in Residue pairs) were so different.It is also a bit shocking that inferred trees from the two alignments were so different.

What is causing this?

There is uncertainty in inferring alignments, but why would the same algorithm give different results when running left-to-right compared to right-to-left?

As far as I can see, there are two different things going on here. One having to do with there being more than one optimal alignment (also discussed in the paper), and one having to do with heuristics in searching for optimal alignments.

When there are more than one optimal alignment (which is often the case), even algorithms guaranteed to find an optimal alignment will give you and arbitrary one (though usually a deterministic arbitrary choice). The arbitrary choice can easily differ between running left-to-right or right-to-left.

For multiple sequence alignments, it is computational infeasible to guarantee to compute an optimal alignment, and heuristics are used to search for (near or locally) optimal alignments. This is often some variation on a greedy strategy, and each choice there will potentially lead to a different alignment. It is easy to see how left-to-right and right-to-left alignments can be different with such a strategy.

In any case, the take home message is, once again: don’t trust alignments!

Landan, G., Graur, D. (2007). Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution, 24(6), 1380-1383. DOI: 10.1093/molbev/msm060

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

3 thoughts on “Heads or tails and reliable alignments”

  1. In my alignment program, Ngila, I made choices about how to break ties to give a “pretty” alignment. IIRC, I chose to shift gaps as far right as they can go, which means reversing the sequences will give you a different alignment.

    However, I’m not a big fan of relying too much on “optimal” alignments. I’d rather see more approaches treat alignments are hidden data and try to optimize around their set.

  2. My feeling is that we should ignore the alignment altogether (in most cases) and focus on the analysis we really want to do. In most cases, we use the alignment as a step towards some other analysis, and I would be much happier to consider the alignment missing data and (to the extend possible) just integrate it out of the equation.

Leave a Reply