Posts Tagged ‘software development’

Designing a hidden Markov model framework

Wednesday, March 11th, 2009

Hidden Markov models (HMMs) are very powerful statistical models for sequence analysis, and are widely used in bioinformatics.

The main benefits of HMMs are that they are 1) usually a pretty easy way of modelling whatever underlying biology you are interested in, and 2) provide a very efficient way of computing the likelihood of your model.

Both are relative, of course, but compared to, say, MCMC methods -- who also provide very powerful ways of modelling -- the simplifying Markov assumption needed for HMMs simplify the modelling (at some cost in model fitting, of course) and inference in HMMs is usually many orders of magnitude faster than inference in MCMCs.

Still, even with very efficient inference algorithms, genome scale data can be a problem.  Analysing alignments of giga-base long genomes can still take CPU days.

So it makes sense to speed up the HMMs if we can.

An algorithmic approach to speeding up the basic HMM algorithms is not immediately obvious.  The basic algorithms simply fill in table entries in O(1) time per cell, so there is little that can be done here.

A more immediate solution is to exploit the parallelisation available on modern hardware, such as single instruction, multiple data (SIMD) operations and multiple cores.

At BiRC we had a Masters student, Asbjørn Brask, who recently wrote a thesis on this: Optimering af generelle skjulte Markovmodeller på multikerne CPUer og GPUer.  The focus was on "dense" HMMs, that is, HMMs where the transition probability matrix contains few if any zero probabilities.  These are common enough to be interesting, and relatively easy to deal with for SIMD instructions since it is easy to pack the entries in the transition matrix into machine words for parallel calculations.

We (myself and a student programmer, Andreas Sand) are now working on a HMM framework to make these ideas available in an easy-to use library.

This turns out to be a bit more complicated than we had hoped.

Not so much implementing the HMM algorithms.  For that we have Asbjørn's code that we can use with a few changes here and there.  The problem is that there is a number of different, orthogonal, ways of representing the data and manipulating it in parallel, which if not handled properly leads to an exponential explosion in the number of classes.  And it is not quite as orthogonal after all, since data representation affects how we can manipulate the data with SIMD instructions.

Data representation

Take the simple table type:

template <typename floattype,
          typename simd_floattype = floattype,
          typename Allocator = BasicAllocator<floattype> >
class HMMTable
{ /* ... */ }

We have a type for the entries in the table. This is important for the precision of the floating point operations, but also has an impact on the speedup we get from SIMD instructions (the smaller the type, the more we can pack into a computer word).

Then we have the SIMD type.  For the most basic algorithms -- with no SIMD acceleration -- this would just be the same as the basic type, but for SIMD instructions it would be a larger computer word into which we can back several of the basic types for parallel manipulation.

Then there is allocation issues.  For SIMD instructions we need the right alignment of types, so we cannot simply use new and delete (where for a solution without SIMD of course we can).  So we need some sort of trait class for dealing with allocation and de-allocation of memory.

On top of this, of course, there's issues such as how the basic type is packed into the SIMD type (which is both important for just accessing the table cells and also for the parallel operations on the table).  Yet another trait class might be needed here.

There is some orthogonality and some dependencies here.

Any basic type for the table cells should work with any SIMD type (as long as the SIMD type can contain the basic type, but that is not going to be a problem on any sensible hardware), but the allocation/deallocation of course depends on the SIMD type as does the packing of cell values into SIMD words.

HMM Algorithms

There's only a few HMM algorithms needed for most analyses.  The Forward and Backward algorithms for maximizing the likelihood and the Viterbi and Posterior Decoding algorithms for analysing the states afterwards.

Still, with SIMD instructions thrown in, there is a lot of problems.

Ideally, we just want one implementation of each algorithm that should work with any type of table (and transition and emission matrices).  This doesn't work, however, since the SIMD instructions and how we can use them in speeding up the algorithms depends on the data representation.

So besides parameterising the algorithms with the table and HMM types, we need a bunch of trait classes for doing all the basic operations, and these depends on the SIMD types in the table and the HMM transition and emission matrices.

Template meta programming?

For putting the different data representations together with the algorithms, we are now thinking about some template meta programming approach.

This way we can weave together the different data representations with the optimal algorithms for working on them, and the user needs only provide the basic type of the matrix cells.  The latter is best left to the user, who hopefully knows the precision he needs for the final results.

I have some experience with this from previous projects, but I don't think I've ever gotten it "just right", so it will be interesting to see how it turns out for this project...

--

70-87=-17

Do you remember Guile?

Tuesday, January 20th, 2009

I used the Guile (Scheme) language as both a configuration language and embedded script language in a number of applications I wrote several years ago (e.g. GeneRecon and CoaSim).

I'm a firm believer in adding extension languages to programs.  As soon as a program is too complex to be run as a simple shell command (without too many command-line options) it needs at least a configuration file, and once you add that, the configuration options will start to get more and more complicated.

You don't want to write your own parser for such a configuration file. Ad hoc solutions will give you an awful format -- just think of sendmail -- and it really isn't worth the effort.  A much better solution is to use an existing programming language and just embed it in your program.

That was the idea behind Guile.  The inspiration was Emacs with its Lisp configuration and extension language.  Guile is a Scheme interpreter and is designed to be easy to embed in your own programs.

To use it, you write a few handles into your application that can be called from Guile, then in your main program you initiate an interpreter, read the configuration file and let the interpreter evaluate it, and viola you have added a lot of flexibility to how your program can be configured.

For even more flexibility, you can read Scheme code interactively while your application runs, or you can define hooks in your configuration script that can be called by the application to customise it.  I use both in CoaSim to get maximal flexibility in my simulator.

I don't use Guile any more. These days, I use Python for my extension language. Well, actually it is usually the other way around: I let Python be in charge and build packages that extends Python.  I find Python easier to work with for most applications than I find Scheme, and there is no doubt that Python is much better known in the bioinformatics field than Scheme is.

Still, I fondly remember Guile.  If you are interested in a little history lesson, check out A Brief History of Guile.

--

20-34=-14

Yet another Newick parser

Monday, January 19th, 2009

Whenever I try out a new framework for writing parsers, I always write a Newick parser.  It is the "Hello, World!" of parsers.

A little while back, I wrote a Newick parser for C++ using Boost.Spirit, and last month I needed to update my Python parser for Newick, so I tried out the Toy Parser Generator framework.

Some students had a problem with my old parser.  It wouldn't accept un-quoted labels with names starting with a number, such as 12b_foobar.  Those labels are, strictly speaking, perfectly valid in the Newick format, but my parser would break the label into two tokens, a number, 12, and a label b_foobar, and there were no easy way to fix that the way the parser is designed.

So I had an opportunity to try out one of the many Python parser frameworks.

I picked the Toy Parser Generator becaused it looked to be very easy to use, and it was.

You write a parser by writing a class that inherits from tpg.Parser, and you then write your tokenizer and parser as the documentation string of this class.  So, to ignore white space and consider every string not containing "^", ",", ":", ";", "(" and ")" a label -- as per the Newick standard -- you write the tokenizer like this:

class Parser(tpg.Parser):
    r"""
    separator space	'\s+' ;
    token label		'[^,:;() \t\n]+' ;
    ...
    """

Similarly, you write the parser in the documentation string using essentially a BNF grammar, but annotated with the actions you want the parser to take when it has parsed a given part of the input.

For example, if we say that a Newick tree consists of a "sub-tree" followed by ";" where a "sub-tree" can be either a leaf or an inner node, we write:

class Parser(tpg.Parser):
    r"""
    ...

    START/tree ->
    	Subtree/tree ';'
    ;
    Subtree/tree ->
    	Internal/tree
      | Leaf/tree
    ;

    ..."""

Here START is the root of the grammar -- and will be the result of the parser call -- while Subtree, Internal and Leaf are non-terminals in the grammar.

The name after the slash in START/tree, Subtree/tree, Internal/tree and Leaf/tree is an identifier used for the actions.

The name after the non-terminal before the "->" will be the result of parsing the sub-expression, so START will result in tree and so will Subtree, where tree is whatever the Python variable tree contains.

On the right hand of the "->" the names after the (non-)terminals will be the result of parsing a sub-expression, so for the START rule, START will return tree that is whatever the Subtree sub-expression returns.

For the Subtree rule, the parser will again return tree, but tree is either the result of parsing an Internal rule or parsing a Leaf rule; the result of both will be named tree (because of the Internal/tree and Leaf/tree).

For a more interesting example, we can look at the way we handle nodes in the tree:

class Parser(tpg.Parser):
    r"""
    ...
    Leaf/leaf ->
    	Name/name                        $ leaf = Leaf(name)
    ;
    Internal/tree ->
    	'\(' BranchList/st '\)' Name/n   $ tree = Internal(n,st)
    ;
    ..."""

A Leaf is just a Name, where Name is a rule I've defined elsewhere. A Name will return a value, name, (it will be a string, but that is only because Name is defined that way).  I translate that value into a Leaf using the expression after the $ where I set the value for "leaf" which is what the Leaf rule will return.

Internal will return a tree -- named "tree" -- and that consists of a list of branches in parenthesis followed by a name (the name of the inner node -- I have actually never seen Newick trees with named inner nodes, but it is in the standard).  The list of branches is parsed using the BranchList rule whos result is stored in the "st" variable (st for sub-trees).  The Name is stored in the "n" variable, and the result of Internal, "tree" is set in the expression after $.

All in all, it is a very easy way to write a parser.  Easier even that Boost.Spirit -- that is also a very nice framework -- and much easier than Yacc/Lex.

It doesn't help me at all with the problem I had with labels starting with numbers, though.

If you first tokenize and then parse, there is no easy way around that.  The string 12b_foobar really is two tokens, a number 12 and a string b_foobar.  Maybe with a framework that would go for the longest legal token first or something like that would help, but that isn't the case with this framework and I am not sure what other consequences that would have...

Anyway, I solved the problem by not solving it in the framework at all.  I just consider both labels and numbers as "labels" and handle the "labels" that should really be numbers -- the branch lengths -- in the expressions in the parser:

class Parser(tpg.Parser):
    r"""
    ...
    Name/name ->
    	label/name
      | Empty               $ name = ""
    ;

    Length/length ->
    	':' Number/length
      | Empty               $ length = 0
    ;

    Number/n ->
        label/l	            $ n = float(l)
    ;
    ..."""

The complete code for the parser can be downloaded here.

--

19-32=-13

Updating my Newick parser

Tuesday, September 30th, 2008

Back in 2003 I wrote a small parser for the Newick tree format.  It is pretty straightforward Python code, and basically just something I hacked up because I needed to manipulate some trees for a project.

Figuring that others might also find it useful I put it on my webpage and that's about it for the story of my Newick parser. I've used it in a few other projects, but haven't really developed it further from the initial code, and haven't really received much feedback on it.

Except for this weekend where I got three emails about it.  I might have received one email a year until now.

It was a few bug reports and some questions, and because of the bug reports I've now made a new release, version 1.3.

I also have a Newick parser for C++.  Actually, I have more than one, since there are two different parsers in QDist and SplitDist, but the one I have in mind is more stand-alone and can probably be used by others.

It is a recursive decent parser I wrote in Boost.Spirit as an exercise to learn the Spirit language.

I think I will clean it up a bit and put it up on the web...

A "biology" programming language?

Wednesday, July 23rd, 2008

According to this press release, some people at Harvard Medical School have developed a new programming language for biological modelling.  I guess protein modelling from the description, but I could be wrong.

"Through incorporating principles of engineering, we've developed a language that can describe biology in the same way a biologist would," says Jeremy Gunawardena, director of the Virtual Cell Program in Harvard Medical School's department of systems biology. "The potential here is enormous. This opens the door to actually performing discovery science, to look at things like drug interactions, right on the computer."

I'm not sure how much of a domain specific or how much of a general purpose programming language it is.  They are comparing it with LISP, so maybe it is a domain specific language with some meta-programming features?

I haven't been able to find the paper yet -- of course it isn't published until today and there might not be any preprints online -- but I'm curious about this...