Here's yet another paper addressing the uncertainty in inferred alignments that is typically ignored when doing comparative genomics. For two others, see my reviews: Alignment bias in genomics and Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes.
Lunter et al.
Genome Res. 18:298-309, 2008
Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.
The paper itself actually has a funny story that I was witness when I worked in Oxford, but I'll keep that story out of here. Those who know it will nod -- or shake their head, as the case might be -- and those who do not probably should hear it from the authors rather than me ;-)
The problem with alignments
Most comparative genomic analysis rely on having an alignment between the genomes being compared. The problem is that we never have such an alignment, but need to infer it. For highly similar sequences, this is not much of a problem, but for even relatively closely related species -- such as men and mice -- we only have really closely related sequences for conserved bits of the genome. We are relatively good at aligning genes, but to really analyse genomes, we cannot rely on only the genes. Especially when we want to infer for example divergence times, where we want to look at neutrally evolving sites and where genes will give us a biased sample as there is usually selection against changes there.
If we align genomes anyway, we need to take into account the uncertainty there is in the alignment, but we typically don't! Once we have inferred an alignment, we treat it as absolute truth. With any other parameter we infer we are expected to report the uncertainty of the estimate together with our estimate, but for alignments we do not.
Probably because this is a lot more difficult to do, but still, completely ignoring the problem just because it is difficult is probably not the way to go.
It might not be such a big problem if the errors in alignment were unbiased, and we based our further inference on large alignments (and thus a large number of alignment columns), but it seems like there is a certain bias in most alignment algorithms.
The source of this bias should be found in the the approach underlying most (if not all) alignment algorithms: optimising some alignment score (or minimising some alignment penalty). Searching for an "optimal" alignment typically means finding an alignment with as few changes as possible -- with varying definitions of "few changes" -- and this strategy will tend to infer alignments with fewer indels than in the true alignment.
Lunter et al. considers the case of pair-wise alignment, and identifies the typical alignment biases (essentially the same biases identified in Lunter 2007). These are shown on the left, where the left-hand side shows the true alignment and the right-hand side the alignment that will typically be inferred. In the two top-most cases, the inferred alignment places the indels incorrectly because (A) moving the indel aligns columns with a more consertation, or (B) two independent indels can be replaced by a single longer indel. In the two other cases, the indels are misplaced because the resulting alignment this way introduces fewer gaps.
There are two expected consequences of these biases: Alignment accuracy decreases close to indels, and indels tend to be merged if near to each other. At the same time, the proportion of identity (columns with no substitutions) increases near indels. In a simulation study, Lunter et al. demonstrates that this is indeed the case. The figure on the left shows the accuracy and proportional sequence identity as a function of the distance to the nearest gap (A) and the distribution of inter-gap lengths (B).
Fixing the problem
Using statistical alignment methods (an application of hidden Markov models), it is possible to capture not only the optimal alignment -- the maximum likelihood alignment, in this case -- but also the uncertainty in the inferred alignment. Using a technique called "posterior decoding" it is possible to assign the probability that a given alignment column is correct to the individual columns. This way, problematic areas of an alignment can be identified.
Not only can posterior decoding annotate an existing alignment, posterior decoding can also tell us the probability that any particular pair of nucleotides should be aligned, implicitly considering the set of all possible alignments where that particular pair is considered homologue. It is possible to construct an alignment from this information, by selecting the alignment that maximises the product of the probabilities assigned to each column in the alignment. This approach differs from the maximum likelihood alignment by not considering the transition probabilities in the underlying hidden Markov model, but can produce better alignments, in the sense that they closer match the true alignment.
Lunter et al. expands on this idea by changing the posterior probability for aligning nucleotides to gaps. Instead of weighting a column with the probability that a given nucleotide matches a particular gap, they weight it with the probability that it matches any gaps. The alignment is then constructed the same way as the posterior decoding algorithm.
The intuition is that around gaps, any posterior is low (compared to nucleotides well away from gaps), but by re-weighting this way, a nucleotide is more likely to align up against a gap when it really should align to a gap.
They then show that this change improves the alignment by both increasing the sensitivity (S) -- the ratio of correctly alignment columns to all homologous colums -- and reducing the false-positive fraction (FPF) -- wrongly aligned nucleotides over non-gapped column -- and reducing the non-homologous fraction -- the fraction of aligned columns that are not truly homologous. The figure on the left compares the maximum likelihood alignment (calculated by the Viterbi algorithm), with the posterior decoding algorithm and their new marginal posterior decoding algorithm.
They also compare with other popular alignment algorithms and show improvements, especially measured by sensitivity and non-homologous fraction (figure on the left). The figure is slightly misleading, since the statistical model used in the Viterbi algorithm is simpler than the one in the marginal posterior decoding, but the paper shows that the real gain in accuracy is due to the algorithm and not to the underlying model:
We found that more accurate modelling resulted in only very marginal improvements of the alignment accuracy. Indeed, in our simulation study of sequences at human–mouse divergence, the modeling of indel lengths using a mixed geometric distribution resulted in the single largest improvement in sensitivity, from 85.3% to 85.6% using Viterbi decoding, and from 87.8% to 88.2% using MPD. The geometric mixture model helps to align sequences across large indels, which are relatively infrequent, explaining the relatively modest improvement. Modeling the variation in GC content reduces the false-positive fraction (from 15.2% to 13.6% using MPD), but has little effect on sensitivity. Surprisingly, accurate modeling of indel and substitution rate variation has little, if any, effect. This robustness to misparameterization is supported by our simulations under the Jukes–Cantor model, where substantial variations in the rate parameters resulted in very little difference.
What is the consequence of the biases introduced by trusting incorrect alignments?
It is not completely obvious to me.
If we move indels around to achieve higher sequence similarity, we end up underestimating the number of substitutions, of course, which means we will tend to underestimate divergence time. The effect depends, of course, on the number of indels between the sequences, since the bias only shows close to indels and if the alignment mainly consists of nucleotides well away from gaps. This means closely related sequences, though.
Improving the inferred alignment, using methods as those introduced here, is a help, of course, but we are still in the situation where we infer an alignment and then treat it as "truth" in the further analysis.
It seems to me that we would be better off carrying the uncertainty over to the further analysis, either by incorporating parameter estimation and such in the statistical alignment algorithms, or by weighing alignment columns by their posterior probability in the further analysis.
Details left to the reader, of course ;-)
Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., Hein, J. (2008). Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Research, 18(2), 298-309. DOI: 10.1101/gr.6725608