Archive for May 7th, 2008

Combining alignment inference and phylogenetic footprinting

Wednesday, May 7th, 2008

ResearchBlogging.org

Since constructing an alignment is really a statistical inference problem, we shouldn't pretend that we can infer it flawlessly and completely ignore the uncertainty in the inference. Especially since doing so can bias the downstream inference. See a list of previous posts where I discuss these problems to get an idea about the problem.

At the very least, we should take into account the uncertainty associated with alignments. In many cases, however, we are not really interested in the alignment in the first place. It is a nuisance parameter in our model, necessary for extracting whatever information we are really interested in.

For example, if we want to annotate a sequence and want to use related sequences for this annotation, we are not really interested in alignments but in the annotation. It is just that we need alignments to get the annotation.

Or do we? Couldn't we just integrate / sum over all alignments, with some suitable prior, and get rid of the nuisance parameter that way?

At least in theory, you can, and in many cases you can in practise as well. I have, myself, a paper under review doing this, but some colleagues in Oxford beat me to it (despite them starting their project long after we had the results for our paper ... we have been a bit slow on this thing):

Combining statistical alignment and phylogenetic footprinting to detect regulatory elements
Rahul Satija, Lior Pachter and Jotun Hein

Bioinformatics 2008 24(10):1236-1242; doi:10.1093/bioinformatics/btn104

Abstract

Motivation: Traditional alignment-based phylogenetic footprinting approaches make predictions on the basis of a single assumed alignment. The predictions are therefore highly sensitive to alignment errors or regions of alignment uncertainty. Alternatively, statistical alignment methods provide a framework for performing phylogenetic analyses by examining a distribution of alignments.

Results: We developed a novel algorithm for predicting functional elements by combining statistical alignment and phylogenetic footprinting (SAPF). SAPF simultaneously performs both alignment and annotation by combining phylogenetic footprinting techniques with an hidden Markov model (HMM) transducer-based multiple alignment model, and can analyze sequence data from multiple sequences. We assessed SAPF's predictive performance on two simulated datasets and three well-annotated cis-regulatory modules from newly sequenced Drosophila genomes. The results demonstrate that removing the traditional dependence on a single alignment can significantly augment the predictive performance, especially when there is uncertainty in the alignment of functional regions.

The idea, quite simply, is this: You can express the alignment problem statistically as a hidden Markov model (HMM), and you can express the footprinting problem as a hidden Markov model, and it is rather straight forward to combine two hidden Markov models into a combined model (at least as long as you do not worry too much about the probabilities and only consider the topology). Running the combined HMM, you can look at the marginal states can consider them solutions to the two original problems. Again, there are some problems with interpreting the marginal states as runs of the marginal HMMs because of the structure imposed on them by the joint HMM, but in most cases it doesn't matter much.

In this paper, they happily just combine the alignment HMM with the footprinting HMM and thereby get footprinting annotations while summing away the alignments as a nuisance parameter.

In my opinion, this is the way to go.

There is one complication, though, and that is the running time. The way the HMM is constructed here, the number of sequences in the data is a problem. This is a general problem, regardless of whether you consider statistical or "plain old" alignment: the running time to get optimal alignments (or sum over them all) grows exponential in the number of sequences.

The way out of this is sampling methods, and there is an obvious Gibbs sampler solution. Using the forward and backward algorithm from HMM theory, you can sample the alignment of two sequences with the correct probability. So, in a multiple sequence alignment, you fix all but one sequence, you can use this trick to sample the conditional alignment of one sequence at a time. That is all you need for a Gibbs sampler.

Unfortunately, the Gibbs approach is left for future work, but I guess we will see something soon ...


Satija, R., Pachter, L., Hein, J. (2008). Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics, 24(10), 1236-1242. DOI: 10.1093/bioinformatics/btn104