Posts Tagged ‘hmmlib’

Multi-core HMMs

Thursday, June 11th, 2009

I’m working on a project where it looks like I could end up with hidden Markov models with hundreds if not thousands of states, and with transition matrices where all entries are non-zero.  Since, for such models, the running time for just calculating the likelihood (not even optimising it) is O(m\cdot n^2) where m is the length of the input and n is the number of states, this could be a problem.  Especially since I’m working on whole genome data, so m can get pretty large as well.

It is part of our CoalHMM work, but I’ll tell you more about that later.  For this post, the actual application isn’t all that important, only the size of the model.

So, anyway, needless to say I need a fast implementation if this is ever going to work.

HMM algorithms

The basic hidden Markov model algorithms are essentially just filling in dynamic programming tables.

The “Forward” algorithm, for example, looks something like

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

There’s a few more details in a real implementation, ’cause this version will have underflows in a few steps, but this is the essential idea.

SIMD speedups

We have developed a small library with these algorithms here at BiRC (mainly implemented by one of my student programmers, Andreas) that speeds up the algorithms using single instruction, multiple data (SIMD) instructions.

The idea here is that we can calculate the sums as above by adding several entries in parallel.  For doubles we can handle two numbers in parallel, for floats we can handle four, so ideally we can speed up the algorithms by a factor of two or four.

We don’t exactly achieve that – there is some overhead involved – but it is not too bad.  The figure below compares the running time with and without SIMD for varying number of states, n, with a fixed input length m=10,000.

The number type – float or double – doesn’t matter for the plain algorithm since the same number of operations is used, so I’ve only plotted the time for doubles.

This is all good and well, but with really large HMMs I think we will need even more of a boost.

Multi-core speedups

It is not hard to get your hands on a multi-core machine these days – if anything it is hard to get a single core machine.  Not massively multi-core machines, although I expect to see those in a few years, but at least dual or quad core machines. So I was hoping to get another factor of two or four from running two or four threads.

In general, parallelising the instructions in the program is harder than parallelising the data operations as we do with the SIMD.  For one thing, there is a lot of synchronisation issues when running multiple threads, and this leads to a lot of overhead.

When filling in the dynamic programming table as above, each column in the table depends on the previous column, but each cell in the column is independent from the other cells in the column.  Thus, we should be able to process each cell in a column independently, but synchronise the threads after each column so we ensure that the previous column is always fully processed.

Since the synchronisation isn’t free, we expect that we need a large number of rows in the table to make it worthwhile, but in my project I expect hundreds to thousands of states, so perhaps there is some gain to it.

I decided to try it out, anyway.

Setting up threads and synchronising them and all that used to require some programming effort, but now that we have OpenMP it is very easy to add parallelism to a program.  To parallelise the algorithm above, you simply add a pragma:

for (int i = 0; i < m; ++i) {
#pragma omp parallel for
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

and by magic the middle loop is parallelised so the body of it is run in different threads.  Setting up the threads, assigning ranges of the sum to the different threads, synchronising them after the loop, etc. is all taken care of by the compiler.

Neat.

The results are plotted below (on my dual core iMac):

As is evident, there is a large overhead in the threaded version, so for anything below a few hundred states it is slowing the algorithm down instead of speeding it up, but eventually the parallelism does kick in.

Well, for the version without SIMD at least, which is the one I’ve plotted above.

For double SIMD, it is just worth it for the large number of states

while for floats it isn’t really worthwhile for this range of number of states

Of course, this is just the very first attempt at parallelising the algorithms, so there might still be some tricks to try.  For one thing, I have a feeling I can get rid of some of the thread overhead by adding a pragma outside the outer loop. I recall seeing something like that before but I don’t remember quite how it worked.

Anyway, I’ll play with it some more later.

162-165=-3

Trying out Boost.Test

Wednesday, October 8th, 2008

I’ve just started a new programming project — a library for dense HMMs that uses parallel hardware for its computations, if you want to know — and I decided to use Boost.Test for my unit testing.

Normally, I just write my unit tests with asserts and maybe a few home-made macros, but since I am going to use Boost heavily in the code anyway (I do more and more these days) I figured I might as well try out its unit testing framework.

Problems with the documentation

To my great surprise, I had some problems with the documentation of the framework.

Usually, the documentation for the boost libraries is excellent — at least compared to most libraries I use — and if you just read the documentation for Boost.Test it looks great.

There is a lot of it, with detailed descriptions of this and that and with tutorials to get you started.

It’s just that the examples there do not work.

Take for example this program from the tutorial:

#define BOOST_TEST_MODULE MyTest
#include <boost/test/unit_test.hpp>

int add( int i, int j ) { return i+j; }

BOOST_AUTO_TEST_CASE( my_test )
{
    // seven ways to detect and report the same error:
    BOOST_CHECK( add( 2,2 ) == 4 );        // #1 continues on error

    BOOST_REQUIRE( add( 2,2 ) == 4 );      // #2 throws on error

    if( add( 2,2 ) != 4 )
      BOOST_ERROR( "Ouch..." );            // #3 continues on error

    if( add( 2,2 ) != 4 )
      BOOST_FAIL( "Ouch..." );             // #4 throws on error

    if( add( 2,2 ) != 4 ) throw "Ouch..."; // #5 throws on error

    BOOST_CHECK_MESSAGE( add( 2,2 ) == 4,  // #6 continues on error
                         "add(..) result: " << add( 2,2 ) );

    BOOST_CHECK_EQUAL( add( 2,2 ), 4 );	  // #7 continues on error
}

The BOOST_AUTO_TEST_CASE() macro should create a test function and plug it into the framework, and after compiling the file (and linking with -lboost_unit_test_framework) you should have a test program.

Well, you can compile the program, but you cannot link it.  There is no main() function.

Oh well, if you read the header file boost/test/unit_test.hpp you find these lines:

#if defined(BOOST_TEST_DYN_LINK) && defined(BOOST_TEST_MAIN) && !defined(BOOST_TEST_NO_MAIN)
int BOOST_TEST_CALL_DECL
main( int argc, char* argv[] )
{
    return ::boost::unit_test::unit_test_main( &init_unit_test, argc, argv );
}

so it seems that the framework will define main, if only you have defined the right symbols, and yes, adding

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN

to the top of the program makes it run.

There were a few other cases like this, where I couldn’t figure out how to use the framework.  Like testing template functions with a list of different template parameters, but I just worked my way around that problem with my own macros.

Using the framework

There’s a lot of different things you can do with Boost.Test, but so far I’ve just used the very basic functionality.

I use the BOOST_AUTO_TEST_CASE() macro for my test functions.  The different cases are automatically grouped into test suites — one per file (compilation unit) — so I don’t worry about the larger framework.  I write a few test cases per code unit I need to test and the rely on the default behaviour of the framework.

The actual testing is done through various BOOST_CHECK_* macros like in the program above.

Among the macros are tests for floating point numbers that lets you test that two numbers are equal up to a certain accuracy.  This is what you want to check, since testing equality of floating point numbers is rarely a good idea.

So far I’m happy with Boost.Test, and I’m going to try out some of the more advanced features as my project progresses, I think.