Posts Tagged ‘statistical alignment’

StatAlign: a new statistical alignment tool

Wednesday, October 8th, 2008

ResearchBlogging.orgThere’s an application note in the current issue of Bioinformatics that describes a new tool for statistical alignment, StatAlign, developed in my old group in Oxford.

StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees

Ádám Novák , István Miklós, Rune Lyngsø and Jotun Hein

Bioinformatics 2008 24(20):2403-2404

Motivation: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods.

Results: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.

I am personally a firm believer in statistical alignment.  I think it is the way to go, to deal with the uncertainty in inferred alignments and to avoid the artefacts they can create.

For a good introduction to the problems (and how statistical approaches to alignment can help), you should read Lunter et al. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment Genome Res. 18:298-309, 2008 (or my summary of it here).

StatAlign, the tool in the application note, looks like a nice way to attack alignments. Unlike previous approaches I’ve blogged about — and unlike my own small work in statistical alignment — it deals with multiple sequences (where MCMC is needed besides just HMMs).

It samples over both alignments and phylogenies, which is nice if there is any uncertainty in the phylogeny inference (which is typically based on alignments in the first place).

I can imagine that integrating over the phylogenies in the MCMC is the main time-killer, though, so it could be nice if you can turn that part of the state space exploration off in case you have a reasonable idea about the phylogeny but you are uncertain about some parts of the alignment…


A. Novak, I. Miklos, R. Lyngso, J. Hein (2008). StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees Bioinformatics, 24 (20), 2403-2404 DOI: 10.1093/bioinformatics/btn457

Statistical alignment and virus selection paper now online

Monday, July 21st, 2008

The paper I described in a previous post: Investigating selection on viruses: a statistical alignment approach, just got published online today.  Yeah us!

Investigating Selection on Viruses: A Statistical Alignment Approach

Wednesday, June 18th, 2008

ResearchBlogging.org
Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn’t one of the papers I’ve been bitching about — this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter — who does work with this — and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

S. de Groot, T. Mailund, G.A. Lunter and J. Hein

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I’ll add a link to the paper as soon as it is up at the journal.

What’s the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, ’cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won’t work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won’t scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth’s recent approach might have worked — it is the paper I blogged about a little back — but we didn’t know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time…

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn’t really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work…


de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics

Uncertainty in inferred alignments

Monday, March 3rd, 2008

ResearchBlogging.org
Here’s yet another paper addressing the uncertainty in inferred alignments that is typically ignored when doing comparative genomics. For two others, see my reviews: Alignment bias in genomics and Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes.

Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

Lunter et al.

Genome Res. 18:298-309, 2008

Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

The paper itself actually has a funny story that I was witness when I worked in Oxford, but I’ll keep that story out of here. Those who know it will nod — or shake their head, as the case might be — and those who do not probably should hear it from the authors rather than me ;-)

The problem with alignments

Most comparative genomic analysis rely on having an alignment between the genomes being compared. The problem is that we never have such an alignment, but need to infer it. For highly similar sequences, this is not much of a problem, but for even relatively closely related species — such as men and mice — we only have really closely related sequences for conserved bits of the genome. We are relatively good at aligning genes, but to really analyse genomes, we cannot rely on only the genes. Especially when we want to infer for example divergence times, where we want to look at neutrally evolving sites and where genes will give us a biased sample as there is usually selection against changes there.

If we align genomes anyway, we need to take into account the uncertainty there is in the alignment, but we typically don’t! Once we have inferred an alignment, we treat it as absolute truth. With any other parameter we infer we are expected to report the uncertainty of the estimate together with our estimate, but for alignments we do not.

Probably because this is a lot more difficult to do, but still, completely ignoring the problem just because it is difficult is probably not the way to go.

It might not be such a big problem if the errors in alignment were unbiased, and we based our further inference on large alignments (and thus a large number of alignment columns), but it seems like there is a certain bias in most alignment algorithms.

The source of this bias should be found in the the approach underlying most (if not all) alignment algorithms: optimising some alignment score (or minimising some alignment penalty). Searching for an “optimal” alignment typically means finding an alignment with as few changes as possible — with varying definitions of “few changes” — and this strategy will tend to infer alignments with fewer indels than in the true alignment.

Alignment biasesLunter et al. considers the case of pair-wise alignment, and identifies the typical alignment biases (essentially the same biases identified in Lunter 2007). These are shown on the left, where the left-hand side shows the true alignment and the right-hand side the alignment that will typically be inferred. In the two top-most cases, the inferred alignment places the indels incorrectly because (A) moving the indel aligns columns with a more consertation, or (B) two independent indels can be replaced by a single longer indel. In the two other cases, the indels are misplaced because the resulting alignment this way introduces fewer gaps.

Results of alignment biasThere are two expected consequences of these biases: Alignment accuracy decreases close to indels, and indels tend to be merged if near to each other. At the same time, the proportion of identity (columns with no substitutions) increases near indels. In a simulation study, Lunter et al. demonstrates that this is indeed the case. The figure on the left shows the accuracy and proportional sequence identity as a function of the distance to the nearest gap (A) and the distribution of inter-gap lengths (B).

Fixing the problem

Using statistical alignment methods (an application of hidden Markov models), it is possible to capture not only the optimal alignment — the maximum likelihood alignment, in this case — but also the uncertainty in the inferred alignment. Using a technique called “posterior decoding” it is possible to assign the probability that a given alignment column is correct to the individual columns. This way, problematic areas of an alignment can be identified.

Not only can posterior decoding annotate an existing alignment, posterior decoding can also tell us the probability that any particular pair of nucleotides should be aligned, implicitly considering the set of all possible alignments where that particular pair is considered homologue. It is possible to construct an alignment from this information, by selecting the alignment that maximises the product of the probabilities assigned to each column in the alignment. This approach differs from the maximum likelihood alignment by not considering the transition probabilities in the underlying hidden Markov model, but can produce better alignments, in the sense that they closer match the true alignment.

Lunter et al. expands on this idea by changing the posterior probability for aligning nucleotides to gaps. Instead of weighting a column with the probability that a given nucleotide matches a particular gap, they weight it with the probability that it matches any gaps. The alignment is then constructed the same way as the posterior decoding algorithm.

The intuition is that around gaps, any posterior is low (compared to nucleotides well away from gaps), but by re-weighting this way, a nucleotide is more likely to align up against a gap when it really should align to a gap.

Comparison of Viterby (maximum likelihood), posterior decoding, and marginal posterior decodingThey then show that this change improves the alignment by both increasing the sensitivity (S) — the ratio of correctly alignment columns to all homologous colums — and reducing the false-positive fraction (FPF) — wrongly aligned nucleotides over non-gapped column — and reducing the non-homologous fraction — the fraction of aligned columns that are not truly homologous. The figure on the left compares the maximum likelihood alignment (calculated by the Viterbi algorithm), with the posterior decoding algorithm and their new marginal posterior decoding algorithm.

Comparison with other algorithms.They also compare with other popular alignment algorithms and show improvements, especially measured by sensitivity and non-homologous fraction (figure on the left). The figure is slightly misleading, since the statistical model used in the Viterbi algorithm is simpler than the one in the marginal posterior decoding, but the paper shows that the real gain in accuracy is due to the algorithm and not to the underlying model:

We found that more accurate modelling resulted in only very marginal improvements of the alignment accuracy. Indeed, in our simulation study of sequences at human–mouse divergence, the modeling of indel lengths using a mixed geometric distribution resulted in the single largest improvement in sensitivity, from 85.3% to 85.6% using Viterbi decoding, and from 87.8% to 88.2% using MPD. The geometric mixture model helps to align sequences across large indels, which are relatively infrequent, explaining the relatively modest improvement. Modeling the variation in GC content reduces the false-positive fraction (from 15.2% to 13.6% using MPD), but has little effect on sensitivity. Surprisingly, accurate modeling of indel and substitution rate variation has little, if any, effect. This robustness to misparameterization is supported by our simulations under the Jukes–Cantor model, where substantial variations in the rate parameters resulted in very little difference.

So what?

What is the consequence of the biases introduced by trusting incorrect alignments?

It is not completely obvious to me.

If we move indels around to achieve higher sequence similarity, we end up underestimating the number of substitutions, of course, which means we will tend to underestimate divergence time. The effect depends, of course, on the number of indels between the sequences, since the bias only shows close to indels and if the alignment mainly consists of nucleotides well away from gaps. This means closely related sequences, though.

Improving the inferred alignment, using methods as those introduced here, is a help, of course, but we are still in the situation where we infer an alignment and then treat it as “truth” in the further analysis.

It seems to me that we would be better off carrying the uncertainty over to the further analysis, either by incorporating parameter estimation and such in the statistical alignment algorithms, or by weighing alignment columns by their posterior probability in the further analysis.

Details left to the reader, of course ;-)


Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., Hein, J. (2008). Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Research, 18(2), 298-309. DOI: 10.1101/gr.6725608

Alignment bias in genomics

Tuesday, January 29th, 2008

I have previously written a bit about how optimal alignment algorithms introduce an alignment bias and even done some work on it myself (currently submitted for publication, so I cannot link to it yet). Today I saw a paper in the current issue of Science addressing the same problem.

A summary can be found in

Lining Up to Avoid Bias

Antonis Rokas

Science Vol. 319. no. 5862, pp. 416 – 417

and the full paper (probably requires a subscription) is

Alignment Uncertainty and Genomic Analysis

Karen M. Wong, Marc A. Suchard, and John P. Huelsenbech

Science Vol. 319. no. 5862, pp. 473 – 476

The problem with alignments

I’ve already described the problem in the previous post, where I used the examples from Gerton Lunter’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

although there the focus was on the problems with indels. Of course, without indels there simply isn’t any problem with alignment, so that is not as unreasonable as it might sound.

Essentially, the problem is that we use algorithms to infer optimal alignments and then treat these alignments as absolute truth, ignoring the uncertainty in the inference.

In Wong et al. they compare seven different alignment algorithms and consider typical evolutionary analysis — inference of phylogenies and detecting selection — based on the inferred alignments, and see a large variability of analysis result dependent on inference method.

The solution proposed in Wong et al. is the same as Gerton proposes: statistical alignmentet methods. Quoting Wong et al.:

The problem of alignment uncertainty in genomic studies, identified here, is not a problem of sloppy analysis. Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place.

In a comparative genomics study, we advocate that alignment be treated as a random variable, and inferences of parameters of interest to the genomicist, such as the amount of nonsynonymous divergence or the phylogeny, consider the different possible alignments in proportion to their probability.

Of course, this is what the statistical alignment people in Oxford have been trying for years and it is not quite as easy as it sounds.


Citations, for Research Blogging:Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532