Posts Tagged ‘alignment’

This week in the blogs

Sunday, January 25th, 2009

Well, everyone else seems to summarise the posts they found interesting during the week, so it is only fair that I get to as well.  Even with my new year resolution of posting on average a post per day, I cannot cover all the posts I find interesting, so it also gives me an opportunity to simply list a lot of links and perhaps group related posts so you have a chance of reading them together.

In this first installation, though, I’m going to go back a little further this month as well, though, since I collected a few interesting links there. Anyway, here goes:

Genetics

  1. Sequences from first settlers reveal rapid evolution in Icelandic mtDNA pool (PLoS Genetics)
    1. Genetic variation in space & time – Iceland (Gene Expression)
    2. The genetic history of Iceland (Genetic Future)
    3. Ancient DNA analysis of the Icelandic settlers (Me!)
    4. Genetic drift eliminated rare mtDNA haplotypes from Iceland (John Hawks)
    5. mtDNA selection in Iceland? (John Hawks)
  2. Pervasive Hitchhiking at coding and regulatory sites in humans (PLoS Genetics)
    1. Humans have adapted on genome-wide level? (Gene Expression)
    2. How much selection is going on in humans? (Me!)
  3. A genome-wide genetic signature of Jewish ancestry perfectly separates individuals with and without full Jewish ancestry in a large random sample of European Americans (Genome Biology)
    1. How Ashkenazi Jewish are you? (Gene Expression)
    2. Another paper on Ashkenazi Jewish distinctiveness (Dienekes)

Sequences and alignments

  1. Phylogenetic inference under recombination using Bayesian stochastic topology selection (Bioinformatics)
    1. Phylogenetic inference under recombination using Bayesian stochastic topology selection (Me!)
  2. The experts agree (Finchtalk)

Programming

  1. Dynamic languages: Not just for scripting any more (CIO)
  2. Emacs 23 (emacs-fu)

Teaching

  1. Making classes interactive: better learning or just more fun? (Discovering Biology in a Digital World)
  2. TeacherTube: YouTube for teachers (Discovering Biology in a Digital World)
  3. Students know what physicists believe, but they don’t agree: A study using the CLASS survey (Phys. Rev. ST Phys. Educ.)
    1. Students know what physicists belive, but they don’t agree (Uncertain Principles)

Peer reviewing

  1. How are the mighty fallen (Michael Nielsen)
  2. Three myths about scientific peer review (Michael Nielsen)

25-40=-15

Heads or tails and reliable alignments

Sunday, March 23rd, 2008

ResearchBlogging.orgI have on several occasions written about the uncertainty inherent in inferred alignments and how this is a potential problem. I hadn’t really thought it would be quite so serious as the results in the paper I just read:

Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments
Giddy Landan and Dan Graur
Molecular Biology and Evolution 2007 24(6):1380-1383; doi:10.1093/molbev/msm060

Abstract

The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

In this paper they analyse the quality of multiple sequence alignments in an extremely simple manner: They first align the sequences left to right, then reverse them to essentially align them right to left. Unless the alignment algorithm has a preferred order of symbols, you’d expect to get the same alignment going left to right as right to left.

Not always, of course: if the algorithm is based on oligonucleotides or such, then the order matters, but in many cases it doesn’t.

Comparing head and tail alignments

When the order shouldn’t matter, the left-to-right and right-to-left alignments (head and tail alignments in the paper) should be similar, so comparing them should give an indication of how much faith you can have in the inferred alignment.

They try this out on a family of 130 amino acid sequences of length around 400 using three different alignment tools. This is the result:

  ClustalW MUSCLE ProbCons
Columns 18.0% 8.7% 6.7%
Residue pairs 52.1% 53.7% 60.8%
Shared splits 64.6% 65.4% 59.1%

Here Columns denotes the fraction of identical columns in the alignment, Residue pairs denote the fraction of pairs (in a “sum of pairs” kind of way) that are identical, and Shared splits denote the fraction of identical splits (edges) in BioNJ inferred trees from the two alignments.Very few alignment columns are shared between the two alignments, but that is not that much of a problem. With 130 sequences you wouldn’t expect to match many columns exactly. I’m more surprised that the resulting pairwise alignments (the pairwise alignments you get by extracting two rows from the alignment, the identity given in Residue pairs) were so different.It is also a bit shocking that inferred trees from the two alignments were so different.

What is causing this?

There is uncertainty in inferring alignments, but why would the same algorithm give different results when running left-to-right compared to right-to-left?

As far as I can see, there are two different things going on here. One having to do with there being more than one optimal alignment (also discussed in the paper), and one having to do with heuristics in searching for optimal alignments.

When there are more than one optimal alignment (which is often the case), even algorithms guaranteed to find an optimal alignment will give you and arbitrary one (though usually a deterministic arbitrary choice). The arbitrary choice can easily differ between running left-to-right or right-to-left.

For multiple sequence alignments, it is computational infeasible to guarantee to compute an optimal alignment, and heuristics are used to search for (near or locally) optimal alignments. This is often some variation on a greedy strategy, and each choice there will potentially lead to a different alignment. It is easy to see how left-to-right and right-to-left alignments can be different with such a strategy.

In any case, the take home message is, once again: don’t trust alignments!


Landan, G., Graur, D. (2007). Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution, 24(6), 1380-1383. DOI: 10.1093/molbev/msm060

Uncertainty in inferred alignments

Monday, March 3rd, 2008

ResearchBlogging.org
Here’s yet another paper addressing the uncertainty in inferred alignments that is typically ignored when doing comparative genomics. For two others, see my reviews: Alignment bias in genomics and Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes.

Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

Lunter et al.

Genome Res. 18:298-309, 2008

Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

The paper itself actually has a funny story that I was witness when I worked in Oxford, but I’ll keep that story out of here. Those who know it will nod — or shake their head, as the case might be — and those who do not probably should hear it from the authors rather than me ;-)

The problem with alignments

Most comparative genomic analysis rely on having an alignment between the genomes being compared. The problem is that we never have such an alignment, but need to infer it. For highly similar sequences, this is not much of a problem, but for even relatively closely related species — such as men and mice — we only have really closely related sequences for conserved bits of the genome. We are relatively good at aligning genes, but to really analyse genomes, we cannot rely on only the genes. Especially when we want to infer for example divergence times, where we want to look at neutrally evolving sites and where genes will give us a biased sample as there is usually selection against changes there.

If we align genomes anyway, we need to take into account the uncertainty there is in the alignment, but we typically don’t! Once we have inferred an alignment, we treat it as absolute truth. With any other parameter we infer we are expected to report the uncertainty of the estimate together with our estimate, but for alignments we do not.

Probably because this is a lot more difficult to do, but still, completely ignoring the problem just because it is difficult is probably not the way to go.

It might not be such a big problem if the errors in alignment were unbiased, and we based our further inference on large alignments (and thus a large number of alignment columns), but it seems like there is a certain bias in most alignment algorithms.

The source of this bias should be found in the the approach underlying most (if not all) alignment algorithms: optimising some alignment score (or minimising some alignment penalty). Searching for an “optimal” alignment typically means finding an alignment with as few changes as possible — with varying definitions of “few changes” — and this strategy will tend to infer alignments with fewer indels than in the true alignment.

Alignment biasesLunter et al. considers the case of pair-wise alignment, and identifies the typical alignment biases (essentially the same biases identified in Lunter 2007). These are shown on the left, where the left-hand side shows the true alignment and the right-hand side the alignment that will typically be inferred. In the two top-most cases, the inferred alignment places the indels incorrectly because (A) moving the indel aligns columns with a more consertation, or (B) two independent indels can be replaced by a single longer indel. In the two other cases, the indels are misplaced because the resulting alignment this way introduces fewer gaps.

Results of alignment biasThere are two expected consequences of these biases: Alignment accuracy decreases close to indels, and indels tend to be merged if near to each other. At the same time, the proportion of identity (columns with no substitutions) increases near indels. In a simulation study, Lunter et al. demonstrates that this is indeed the case. The figure on the left shows the accuracy and proportional sequence identity as a function of the distance to the nearest gap (A) and the distribution of inter-gap lengths (B).

Fixing the problem

Using statistical alignment methods (an application of hidden Markov models), it is possible to capture not only the optimal alignment — the maximum likelihood alignment, in this case — but also the uncertainty in the inferred alignment. Using a technique called “posterior decoding” it is possible to assign the probability that a given alignment column is correct to the individual columns. This way, problematic areas of an alignment can be identified.

Not only can posterior decoding annotate an existing alignment, posterior decoding can also tell us the probability that any particular pair of nucleotides should be aligned, implicitly considering the set of all possible alignments where that particular pair is considered homologue. It is possible to construct an alignment from this information, by selecting the alignment that maximises the product of the probabilities assigned to each column in the alignment. This approach differs from the maximum likelihood alignment by not considering the transition probabilities in the underlying hidden Markov model, but can produce better alignments, in the sense that they closer match the true alignment.

Lunter et al. expands on this idea by changing the posterior probability for aligning nucleotides to gaps. Instead of weighting a column with the probability that a given nucleotide matches a particular gap, they weight it with the probability that it matches any gaps. The alignment is then constructed the same way as the posterior decoding algorithm.

The intuition is that around gaps, any posterior is low (compared to nucleotides well away from gaps), but by re-weighting this way, a nucleotide is more likely to align up against a gap when it really should align to a gap.

Comparison of Viterby (maximum likelihood), posterior decoding, and marginal posterior decodingThey then show that this change improves the alignment by both increasing the sensitivity (S) — the ratio of correctly alignment columns to all homologous colums — and reducing the false-positive fraction (FPF) — wrongly aligned nucleotides over non-gapped column — and reducing the non-homologous fraction — the fraction of aligned columns that are not truly homologous. The figure on the left compares the maximum likelihood alignment (calculated by the Viterbi algorithm), with the posterior decoding algorithm and their new marginal posterior decoding algorithm.

Comparison with other algorithms.They also compare with other popular alignment algorithms and show improvements, especially measured by sensitivity and non-homologous fraction (figure on the left). The figure is slightly misleading, since the statistical model used in the Viterbi algorithm is simpler than the one in the marginal posterior decoding, but the paper shows that the real gain in accuracy is due to the algorithm and not to the underlying model:

We found that more accurate modelling resulted in only very marginal improvements of the alignment accuracy. Indeed, in our simulation study of sequences at human–mouse divergence, the modeling of indel lengths using a mixed geometric distribution resulted in the single largest improvement in sensitivity, from 85.3% to 85.6% using Viterbi decoding, and from 87.8% to 88.2% using MPD. The geometric mixture model helps to align sequences across large indels, which are relatively infrequent, explaining the relatively modest improvement. Modeling the variation in GC content reduces the false-positive fraction (from 15.2% to 13.6% using MPD), but has little effect on sensitivity. Surprisingly, accurate modeling of indel and substitution rate variation has little, if any, effect. This robustness to misparameterization is supported by our simulations under the Jukes–Cantor model, where substantial variations in the rate parameters resulted in very little difference.

So what?

What is the consequence of the biases introduced by trusting incorrect alignments?

It is not completely obvious to me.

If we move indels around to achieve higher sequence similarity, we end up underestimating the number of substitutions, of course, which means we will tend to underestimate divergence time. The effect depends, of course, on the number of indels between the sequences, since the bias only shows close to indels and if the alignment mainly consists of nucleotides well away from gaps. This means closely related sequences, though.

Improving the inferred alignment, using methods as those introduced here, is a help, of course, but we are still in the situation where we infer an alignment and then treat it as “truth” in the further analysis.

It seems to me that we would be better off carrying the uncertainty over to the further analysis, either by incorporating parameter estimation and such in the statistical alignment algorithms, or by weighing alignment columns by their posterior probability in the further analysis.

Details left to the reader, of course ;-)


Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., Hein, J. (2008). Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Research, 18(2), 298-309. DOI: 10.1101/gr.6725608

Alignment bias in genomics

Tuesday, January 29th, 2008

I have previously written a bit about how optimal alignment algorithms introduce an alignment bias and even done some work on it myself (currently submitted for publication, so I cannot link to it yet). Today I saw a paper in the current issue of Science addressing the same problem.

A summary can be found in

Lining Up to Avoid Bias

Antonis Rokas

Science Vol. 319. no. 5862, pp. 416 – 417

and the full paper (probably requires a subscription) is

Alignment Uncertainty and Genomic Analysis

Karen M. Wong, Marc A. Suchard, and John P. Huelsenbech

Science Vol. 319. no. 5862, pp. 473 – 476

The problem with alignments

I’ve already described the problem in the previous post, where I used the examples from Gerton Lunter’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

although there the focus was on the problems with indels. Of course, without indels there simply isn’t any problem with alignment, so that is not as unreasonable as it might sound.

Essentially, the problem is that we use algorithms to infer optimal alignments and then treat these alignments as absolute truth, ignoring the uncertainty in the inference.

In Wong et al. they compare seven different alignment algorithms and consider typical evolutionary analysis — inference of phylogenies and detecting selection — based on the inferred alignments, and see a large variability of analysis result dependent on inference method.

The solution proposed in Wong et al. is the same as Gerton proposes: statistical alignmentet methods. Quoting Wong et al.:

The problem of alignment uncertainty in genomic studies, identified here, is not a problem of sloppy analysis. Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place.

In a comparative genomics study, we advocate that alignment be treated as a random variable, and inferences of parameters of interest to the genomicist, such as the amount of nonsynonymous divergence or the phylogeny, consider the different possible alignments in proportion to their probability.

Of course, this is what the statistical alignment people in Oxford have been trying for years and it is not quite as easy as it sounds.


Citations, for Research Blogging:Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532

Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes

Wednesday, January 9th, 2008

ResearchBlogging.org

Today, while preparing for a thesis meeting with Ricky, I read Gerton’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

Abstract

Motivation: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes.

Results: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around gamma = 8, rather than the currently accepted value of about gamma = 14.

I knew the results before, from discussions with Gerton, but this is the first time I’ve actually read it.The paper concerns the biases in placing gaps in alignment algorithms (whether probabilistic or parsimony based) and how these will tend to underestimate the number of indels in the true alignment and thus the indel rate.

Gap errors

The problem with gaps is that it is almost always better to have a few extra substitutions compared to a few extra gaps, since indels are less frequent and so the occurrence of them are less likely. When maximising the likelihood of the alignment, we therefore tend to remove gaps that should be there (even unlikely events do occur from time to time) and instead adds substitutions that should be there.

Unbiased estimator

Using statistical alignment and posterior decoding Gerton derives another estimator for the indel rate and shows that this essentially removes the bias. The essential idea is that when the alignment is derived through the statistical alignment algorithm, areas where gaps are misplaced will have a lower posterior certainty. The optimal alignment that is derived is not significantly more likely than several others, so the posterior probability of that exact alignment is less than it would be if placement of the gaps was more certain.

The new estimator is the red line on the plot on the right. The blue is what you would get if you just trusted the most likely alignment. The green line you get by fitting the neutral indel model from his earlier paper Genome-Wide Identification of Human functional DNA Using a Neutral Indel Model Lunter, Ponting and Hein, Plos Computational Biology 2006.

The reason the bias only shows when the substitutation rate is rather high is, of course, that you are less likely to mistake non-homologous sequences as homologous when mis-placing a gap if you have a low sequence identity on the true alignment compared to when you have a high sequence identity, i.e. when you have a low substitution rate.


The citation, for Research Blogging:
Lunter, G. (2007). Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics, 23(13), i289-i296. DOI: 10.1093/bioinformatics/btm185