Posts Tagged ‘C++’

Designing a hidden Markov model framework

Wednesday, March 11th, 2009

Hidden Markov models (HMMs) are very powerful statistical models for sequence analysis, and are widely used in bioinformatics.

The main benefits of HMMs are that they are 1) usually a pretty easy way of modelling whatever underlying biology you are interested in, and 2) provide a very efficient way of computing the likelihood of your model.

Both are relative, of course, but compared to, say, MCMC methods — who also provide very powerful ways of modelling — the simplifying Markov assumption needed for HMMs simplify the modelling (at some cost in model fitting, of course) and inference in HMMs is usually many orders of magnitude faster than inference in MCMCs.

Still, even with very efficient inference algorithms, genome scale data can be a problem.  Analysing alignments of giga-base long genomes can still take CPU days.

So it makes sense to speed up the HMMs if we can.

An algorithmic approach to speeding up the basic HMM algorithms is not immediately obvious.  The basic algorithms simply fill in table entries in O(1) time per cell, so there is little that can be done here.

A more immediate solution is to exploit the parallelisation available on modern hardware, such as single instruction, multiple data (SIMD) operations and multiple cores.

At BiRC we had a Masters student, Asbjørn Brask, who recently wrote a thesis on this: Optimering af generelle skjulte Markovmodeller på multikerne CPUer og GPUer.  The focus was on “dense” HMMs, that is, HMMs where the transition probability matrix contains few if any zero probabilities.  These are common enough to be interesting, and relatively easy to deal with for SIMD instructions since it is easy to pack the entries in the transition matrix into machine words for parallel calculations.

We (myself and a student programmer, Andreas Sand) are now working on a HMM framework to make these ideas available in an easy-to use library.

This turns out to be a bit more complicated than we had hoped.

Not so much implementing the HMM algorithms.  For that we have Asbjørn’s code that we can use with a few changes here and there.  The problem is that there is a number of different, orthogonal, ways of representing the data and manipulating it in parallel, which if not handled properly leads to an exponential explosion in the number of classes.  And it is not quite as orthogonal after all, since data representation affects how we can manipulate the data with SIMD instructions.

Data representation

Take the simple table type:

template <typename floattype,
          typename simd_floattype = floattype,
          typename Allocator = BasicAllocator<floattype> >
class HMMTable
{ /* ... */ }

We have a type for the entries in the table. This is important for the precision of the floating point operations, but also has an impact on the speedup we get from SIMD instructions (the smaller the type, the more we can pack into a computer word).

Then we have the SIMD type.  For the most basic algorithms — with no SIMD acceleration — this would just be the same as the basic type, but for SIMD instructions it would be a larger computer word into which we can back several of the basic types for parallel manipulation.

Then there is allocation issues.  For SIMD instructions we need the right alignment of types, so we cannot simply use new and delete (where for a solution without SIMD of course we can).  So we need some sort of trait class for dealing with allocation and de-allocation of memory.

On top of this, of course, there’s issues such as how the basic type is packed into the SIMD type (which is both important for just accessing the table cells and also for the parallel operations on the table).  Yet another trait class might be needed here.

There is some orthogonality and some dependencies here.

Any basic type for the table cells should work with any SIMD type (as long as the SIMD type can contain the basic type, but that is not going to be a problem on any sensible hardware), but the allocation/deallocation of course depends on the SIMD type as does the packing of cell values into SIMD words.

HMM Algorithms

There’s only a few HMM algorithms needed for most analyses.  The Forward and Backward algorithms for maximizing the likelihood and the Viterbi and Posterior Decoding algorithms for analysing the states afterwards.

Still, with SIMD instructions thrown in, there is a lot of problems.

Ideally, we just want one implementation of each algorithm that should work with any type of table (and transition and emission matrices).  This doesn’t work, however, since the SIMD instructions and how we can use them in speeding up the algorithms depends on the data representation.

So besides parameterising the algorithms with the table and HMM types, we need a bunch of trait classes for doing all the basic operations, and these depends on the SIMD types in the table and the HMM transition and emission matrices.

Template meta programming?

For putting the different data representations together with the algorithms, we are now thinking about some template meta programming approach.

This way we can weave together the different data representations with the optimal algorithms for working on them, and the user needs only provide the basic type of the matrix cells.  The latter is best left to the user, who hopefully knows the precision he needs for the final results.

I have some experience with this from previous projects, but I don’t think I’ve ever gotten it “just right”, so it will be interesting to see how it turns out for this project…

70-87=-17

Why you need typename

Tuesday, January 20th, 2009

One of our PhD students had problems with C++ today and asked me for help.

He had a piece of code that, for this post, can be reduced to this:

#include <vector>

template <typename T>
class C {
    void foo(std::vector<t>::iterator i);
};

If you are an experienced C++ programmer, it doesn’t take you long to see the problem here, but if you are not, this piece of code looks perfectly valid and the error message you get from g++

test.cc:6: error: ‘class std::vector<T, std::allocator<_CharT> >::iterator’ is not a type
test.cc:7: error: expected unqualified-id at end of input

is of no help at all.

Why isn’t vector<T>::iterator a type? It always is!

Well, the thing is, that is not actually guaranteed.

Another example that shows the problem might make it clearer.

template <typename T>
struct C {
    int p; 

    void foo()
    {
	T::t * p;
    }
};

In foo, are we defining p as a pointer to a T::t, or are we just multiplying a static member T::t with the instance variable p?

You cannot know that before you know T, and to get around that problem, the C++ standard requires that you make types explicit using the typename keyword.  If you don’t, it will assume that you mean a static member.  So you should change the example to this:

template <typename T>
struct C {
    int p; 

    void foo()
    {
	typename T::t * p;
    }
};

What about the vector example, then?  Surely a vector<T>::iterator is always a type?

No, actually not.  You can always specialize a template like vector<T> and for example say that a vector of class C has a static member named iterator, instead of a typedef.

Think about vector<bool>.  That is a specialization, and is different from what the generic vector<T> would be, if it had been instantiated with T=bool.

Until the point where a template is instantiated, you really know very little about it, so you need to help the compiler.

The fix to the original problem is just adding the typename keyword:

#include <vector>

template <typename T>
class C {
    void foo(typename std::vector<t>::iterator i);
};

20-36=-16

Trying out Boost.Test

Wednesday, October 8th, 2008

I’ve just started a new programming project — a library for dense HMMs that uses parallel hardware for its computations, if you want to know — and I decided to use Boost.Test for my unit testing.

Normally, I just write my unit tests with asserts and maybe a few home-made macros, but since I am going to use Boost heavily in the code anyway (I do more and more these days) I figured I might as well try out its unit testing framework.

Problems with the documentation

To my great surprise, I had some problems with the documentation of the framework.

Usually, the documentation for the boost libraries is excellent — at least compared to most libraries I use — and if you just read the documentation for Boost.Test it looks great.

There is a lot of it, with detailed descriptions of this and that and with tutorials to get you started.

It’s just that the examples there do not work.

Take for example this program from the tutorial:

#define BOOST_TEST_MODULE MyTest
#include <boost/test/unit_test.hpp>

int add( int i, int j ) { return i+j; }

BOOST_AUTO_TEST_CASE( my_test )
{
    // seven ways to detect and report the same error:
    BOOST_CHECK( add( 2,2 ) == 4 );        // #1 continues on error

    BOOST_REQUIRE( add( 2,2 ) == 4 );      // #2 throws on error

    if( add( 2,2 ) != 4 )
      BOOST_ERROR( "Ouch..." );            // #3 continues on error

    if( add( 2,2 ) != 4 )
      BOOST_FAIL( "Ouch..." );             // #4 throws on error

    if( add( 2,2 ) != 4 ) throw "Ouch..."; // #5 throws on error

    BOOST_CHECK_MESSAGE( add( 2,2 ) == 4,  // #6 continues on error
                         "add(..) result: " << add( 2,2 ) );

    BOOST_CHECK_EQUAL( add( 2,2 ), 4 );	  // #7 continues on error
}

The BOOST_AUTO_TEST_CASE() macro should create a test function and plug it into the framework, and after compiling the file (and linking with -lboost_unit_test_framework) you should have a test program.

Well, you can compile the program, but you cannot link it.  There is no main() function.

Oh well, if you read the header file boost/test/unit_test.hpp you find these lines:

#if defined(BOOST_TEST_DYN_LINK) && defined(BOOST_TEST_MAIN) && !defined(BOOST_TEST_NO_MAIN)
int BOOST_TEST_CALL_DECL
main( int argc, char* argv[] )
{
    return ::boost::unit_test::unit_test_main( &init_unit_test, argc, argv );
}

so it seems that the framework will define main, if only you have defined the right symbols, and yes, adding

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN

to the top of the program makes it run.

There were a few other cases like this, where I couldn’t figure out how to use the framework.  Like testing template functions with a list of different template parameters, but I just worked my way around that problem with my own macros.

Using the framework

There’s a lot of different things you can do with Boost.Test, but so far I’ve just used the very basic functionality.

I use the BOOST_AUTO_TEST_CASE() macro for my test functions.  The different cases are automatically grouped into test suites — one per file (compilation unit) — so I don’t worry about the larger framework.  I write a few test cases per code unit I need to test and the rely on the default behaviour of the framework.

The actual testing is done through various BOOST_CHECK_* macros like in the program above.

Among the macros are tests for floating point numbers that lets you test that two numbers are equal up to a certain accuracy.  This is what you want to check, since testing equality of floating point numbers is rarely a good idea.

So far I’m happy with Boost.Test, and I’m going to try out some of the more advanced features as my project progresses, I think.

Newick C++ parser in Boost.Spirit

Friday, October 3rd, 2008

Here’s the Newick parser for C++ I mentioned in a previous post.  It is written in Boost.Spirit, so it is a recursive decend parser.  This might give you problems with the stack if you are parsing really deep trees, but for moderately sized trees (a few thousands of taxa) it shouldn’t be a problem.

It is code I pulled out of a tool for coalescence trees, so it only handles rooted binary trees.  To make it a little more useful I’ll add code for re-rooting the tree.  I have that code already, but in a different tool that only works on the tree topology, so I need to handle the branch lengths before I can add it here.

Let me know what you think.

Code rot?

Tuesday, January 22nd, 2008

I’ve just released a new version of QuickJoin. I just needed to add a tiny little feature, so it wasn’t that much of a deal, but I was horrified by the code. QuickJoin is from 2003 and one of the first applications I wrote in C++. The first two were QDist and SplitDist, and I dare not look at the code in those.

Did someone go in and change my code, or was I really that bad at C++ back then? Had I even heard of std::auto_ptr<>?