Posts Tagged ‘HMM’

Designing a hidden Markov model framework

Wednesday, March 11th, 2009

Hidden Markov models (HMMs) are very powerful statistical models for sequence analysis, and are widely used in bioinformatics.

The main benefits of HMMs are that they are 1) usually a pretty easy way of modelling whatever underlying biology you are interested in, and 2) provide a very efficient way of computing the likelihood of your model.

Both are relative, of course, but compared to, say, MCMC methods — who also provide very powerful ways of modelling — the simplifying Markov assumption needed for HMMs simplify the modelling (at some cost in model fitting, of course) and inference in HMMs is usually many orders of magnitude faster than inference in MCMCs.

Still, even with very efficient inference algorithms, genome scale data can be a problem.  Analysing alignments of giga-base long genomes can still take CPU days.

So it makes sense to speed up the HMMs if we can.

An algorithmic approach to speeding up the basic HMM algorithms is not immediately obvious.  The basic algorithms simply fill in table entries in O(1) time per cell, so there is little that can be done here.

A more immediate solution is to exploit the parallelisation available on modern hardware, such as single instruction, multiple data (SIMD) operations and multiple cores.

At BiRC we had a Masters student, Asbjørn Brask, who recently wrote a thesis on this: Optimering af generelle skjulte Markovmodeller på multikerne CPUer og GPUer.  The focus was on “dense” HMMs, that is, HMMs where the transition probability matrix contains few if any zero probabilities.  These are common enough to be interesting, and relatively easy to deal with for SIMD instructions since it is easy to pack the entries in the transition matrix into machine words for parallel calculations.

We (myself and a student programmer, Andreas Sand) are now working on a HMM framework to make these ideas available in an easy-to use library.

This turns out to be a bit more complicated than we had hoped.

Not so much implementing the HMM algorithms.  For that we have Asbjørn’s code that we can use with a few changes here and there.  The problem is that there is a number of different, orthogonal, ways of representing the data and manipulating it in parallel, which if not handled properly leads to an exponential explosion in the number of classes.  And it is not quite as orthogonal after all, since data representation affects how we can manipulate the data with SIMD instructions.

Data representation

Take the simple table type:

template <typename floattype,
          typename simd_floattype = floattype,
          typename Allocator = BasicAllocator<floattype> >
class HMMTable
{ /* ... */ }

We have a type for the entries in the table. This is important for the precision of the floating point operations, but also has an impact on the speedup we get from SIMD instructions (the smaller the type, the more we can pack into a computer word).

Then we have the SIMD type.  For the most basic algorithms — with no SIMD acceleration — this would just be the same as the basic type, but for SIMD instructions it would be a larger computer word into which we can back several of the basic types for parallel manipulation.

Then there is allocation issues.  For SIMD instructions we need the right alignment of types, so we cannot simply use new and delete (where for a solution without SIMD of course we can).  So we need some sort of trait class for dealing with allocation and de-allocation of memory.

On top of this, of course, there’s issues such as how the basic type is packed into the SIMD type (which is both important for just accessing the table cells and also for the parallel operations on the table).  Yet another trait class might be needed here.

There is some orthogonality and some dependencies here.

Any basic type for the table cells should work with any SIMD type (as long as the SIMD type can contain the basic type, but that is not going to be a problem on any sensible hardware), but the allocation/deallocation of course depends on the SIMD type as does the packing of cell values into SIMD words.

HMM Algorithms

There’s only a few HMM algorithms needed for most analyses.  The Forward and Backward algorithms for maximizing the likelihood and the Viterbi and Posterior Decoding algorithms for analysing the states afterwards.

Still, with SIMD instructions thrown in, there is a lot of problems.

Ideally, we just want one implementation of each algorithm that should work with any type of table (and transition and emission matrices).  This doesn’t work, however, since the SIMD instructions and how we can use them in speeding up the algorithms depends on the data representation.

So besides parameterising the algorithms with the table and HMM types, we need a bunch of trait classes for doing all the basic operations, and these depends on the SIMD types in the table and the HMM transition and emission matrices.

Template meta programming?

For putting the different data representations together with the algorithms, we are now thinking about some template meta programming approach.

This way we can weave together the different data representations with the optimal algorithms for working on them, and the user needs only provide the basic type of the matrix cells.  The latter is best left to the user, who hopefully knows the precision he needs for the final results.

I have some experience with this from previous projects, but I don’t think I’ve ever gotten it “just right”, so it will be interesting to see how it turns out for this project…

70-87=-17

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Friday, January 16th, 2009

There’s an interesting paper in the current issue of Bioinformatics that I’ve just finished reading:

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Webb et al. Bioinformatics 25(2) 197-203

Abstract

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.

Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

The paper presents a new method for analysing sequences that have undergone recombination.

When sequences have not undergone recombination, a nice methodology for analysing them is the PhyloHMM (PDF). With this method, you have a hidden Markov model where the emission probability is determined by a phylogeny, and usually computed using Felsenstein’s pruning algorithm.

When there is recombination, the problem is that there are more than one topology for the underlying phylogeny, and if you do not know the topologies you cannot immediately calculate the emission probabilities.

You can instead model the unknown topologies as hidden states. This approach was taken by Husmeier and McGuire (2003) and is also the approach we take in our CoalHMM method (Hobolth et al 2007).

This approach doesn’t scale, however, since the number of possible toplogies grows super-exponential with the number of species.

In this paper the solve the problem by using only a few topologies as states in the HMM, but sampling over all possible topologies to be used, in an MCMC approach.  Ideally the number of topologies should be variable, but that requires a reversible jump MCMC and they haven’t implemented that.  Still, it seems to work very well.

I remember discussing the problem with both Alex and Chris when I was last in Oxford, but back then it didn’t work so well, so I am happy to read that they’ve solved the problems. Properly handling recombination and changing topologies is important for accurate sequence analysis.


A. Webb, J. M. Hancock, C. C. Holmes (2008). Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, 25 (2), 197-203 DOI: 10.1093/bioinformatics/btn607
16-29=-13

Trying out Boost.Test

Wednesday, October 8th, 2008

I’ve just started a new programming project — a library for dense HMMs that uses parallel hardware for its computations, if you want to know — and I decided to use Boost.Test for my unit testing.

Normally, I just write my unit tests with asserts and maybe a few home-made macros, but since I am going to use Boost heavily in the code anyway (I do more and more these days) I figured I might as well try out its unit testing framework.

Problems with the documentation

To my great surprise, I had some problems with the documentation of the framework.

Usually, the documentation for the boost libraries is excellent — at least compared to most libraries I use — and if you just read the documentation for Boost.Test it looks great.

There is a lot of it, with detailed descriptions of this and that and with tutorials to get you started.

It’s just that the examples there do not work.

Take for example this program from the tutorial:

#define BOOST_TEST_MODULE MyTest
#include <boost/test/unit_test.hpp>

int add( int i, int j ) { return i+j; }

BOOST_AUTO_TEST_CASE( my_test )
{
    // seven ways to detect and report the same error:
    BOOST_CHECK( add( 2,2 ) == 4 );        // #1 continues on error

    BOOST_REQUIRE( add( 2,2 ) == 4 );      // #2 throws on error

    if( add( 2,2 ) != 4 )
      BOOST_ERROR( "Ouch..." );            // #3 continues on error

    if( add( 2,2 ) != 4 )
      BOOST_FAIL( "Ouch..." );             // #4 throws on error

    if( add( 2,2 ) != 4 ) throw "Ouch..."; // #5 throws on error

    BOOST_CHECK_MESSAGE( add( 2,2 ) == 4,  // #6 continues on error
                         "add(..) result: " << add( 2,2 ) );

    BOOST_CHECK_EQUAL( add( 2,2 ), 4 );	  // #7 continues on error
}

The BOOST_AUTO_TEST_CASE() macro should create a test function and plug it into the framework, and after compiling the file (and linking with -lboost_unit_test_framework) you should have a test program.

Well, you can compile the program, but you cannot link it.  There is no main() function.

Oh well, if you read the header file boost/test/unit_test.hpp you find these lines:

#if defined(BOOST_TEST_DYN_LINK) && defined(BOOST_TEST_MAIN) && !defined(BOOST_TEST_NO_MAIN)
int BOOST_TEST_CALL_DECL
main( int argc, char* argv[] )
{
    return ::boost::unit_test::unit_test_main( &init_unit_test, argc, argv );
}

so it seems that the framework will define main, if only you have defined the right symbols, and yes, adding

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN

to the top of the program makes it run.

There were a few other cases like this, where I couldn’t figure out how to use the framework.  Like testing template functions with a list of different template parameters, but I just worked my way around that problem with my own macros.

Using the framework

There’s a lot of different things you can do with Boost.Test, but so far I’ve just used the very basic functionality.

I use the BOOST_AUTO_TEST_CASE() macro for my test functions.  The different cases are automatically grouped into test suites — one per file (compilation unit) — so I don’t worry about the larger framework.  I write a few test cases per code unit I need to test and the rely on the default behaviour of the framework.

The actual testing is done through various BOOST_CHECK_* macros like in the program above.

Among the macros are tests for floating point numbers that lets you test that two numbers are equal up to a certain accuracy.  This is what you want to check, since testing equality of floating point numbers is rarely a good idea.

So far I’m happy with Boost.Test, and I’m going to try out some of the more advanced features as my project progresses, I think.

Investigating Selection on Viruses: A Statistical Alignment Approach

Wednesday, June 18th, 2008

ResearchBlogging.org
Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn’t one of the papers I’ve been bitching about — this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter — who does work with this — and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

S. de Groot, T. Mailund, G.A. Lunter and J. Hein

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I’ll add a link to the paper as soon as it is up at the journal.

What’s the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, ’cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won’t work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won’t scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth’s recent approach might have worked — it is the paper I blogged about a little back — but we didn’t know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time…

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn’t really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work…


de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics