Archive for October 8th, 2008

Some thoughs on grid computing…

Wednesday, October 8th, 2008

Earlier this week, the LHC Computing Grid went online.  A description of the system can be found here, and blog posts about it here, here and here.

This got me thinking about grid computing for small scale scientists like myself.

I’ve had some experience with grid computing (see an old post about it here) but mostly I have found it too much trouble to be worth the effort.

Our typical computer use

For large projects that require years of CPU time, it is well worth the effort to set up the infrastructure to run computations on grids.  You really need it to get your the computations done, and the overhead is very small in comparison with the actual computation time.

Most of my projects — and most of the projects we do at BiRC — are a bit different.

We do need the computation power, but we are usually tinkering with our programs for most of a project — since we rarely know exactly how to analyse our data until we are mostly done with it — so we cannot just distribute a fixed version of our software and then start distributing the computations.

The typical work flow is that we write a program for our analysis, then we run the analysis and when we look at the results we find some strange results here and there. Then we extend the software to either extract more information from the data, or to fix a bug that caused the weird results.

We then need to run the analysis again, and repeat the process.

The analysis might take a few CPU days to a few CPU months — so it is small scale for grid applications — but between each analysis we spend a week or so modifying and testing our software.

We have a small cluster of Linux computers for this, and it is always in one of two states: completely overloaded or burning idle cycles.

This is the situation grid computing could fix.  Theoretically we should be able to get CPU cycles off the grid when we need it, and sell it to the grid when we are not running computations ourselves.

In practice, our work pattern makes this difficult.

The problems with small scale grid computing

If you are changing your software all the time, you need to distribute it together with the data you analyse.

This means you either send compiled binaries with the job submissions, or you compile the software as part of the job.

The former is fine if you have a program you can compile — and you’d better link it statically ’cause there is no guarantees about the libraries you can find on the resources that will run it.

If you have a bunch of scripts, you are not so lucky.

There are no guarantees that the computer that will run the computations has the script interpreter — or if it does that it is a version that can run your script — and even if it does, what about the modules you need?

You don’t want to have to compile BioPython or SciPy on a grid machine just to run your scripts.  The overhead in CPU time is going to be several percentage of your actual run (at least if you parallelise your computations to high enough a degree to be worth the grid in the first place), and how can you even know that there is a compiler to compile it at the other end?  You can’t, and there probably isn’t unless you are very lucky.

It is a major pain to see your jobs aborted after slowly making their way through the job queue, just because the host computer cannot even setup the environment you need for your computations.

What can we do about it?

If we want to use the grid for even smaller scale computations, at the very least we need an easier way to distribute new versions of our programs.

I have an idea for this.

Some grids, at least, are already dealing with “runtime environments” where you can specify that your job needs to run in a certain runtime environment, and the scheduler will only send your jobs to resources that can provide that environment.

This sounds like just the thing, but the catch is that it is up to the resource administrators to set up these environments and to tell the grid system that they provide them.

For something like LHC, it is probably not a problem to convince administrators to provide the right environment, but for Thomas Mailund it is.

What we need is a way for the grid users to be able to install environments on the resources!

So how about this: we introduce the concept of “runtime environment packages” that we can upload to the grid system.  They consists of a setup script (configure ; make) and a test suite, for example.

When a resource is idle, it tests if there are new environments available in queue, downloads these, and tries to build and test them.  If it succeeds, it informs the grid system that it can run the new type of environment.  The scheduler only sends jobs to resources that have the right environments, so if your environment tests are working properly, you never end up on a resource that cannot run your jobs.

We could even add environment requirements on the environment packages, so they don’t have to be self-contained.  E.g. to install SciPy, you don’t want to have to install Python itself, and there is no reason for resources without Python to try to install it only to give up.

To prevent resources to be filled up with old environment, we can add a time out period to environements, so they are deleted when they haven’t been used for a couple of days/weeks/months.

It shouldn’t be that hard to implement.  I am sure I could do it, but I don’t have my own grid infrastructure to work with, so I guess I’ll have to intimidate persuade someone else to do it…

StatAlign: a new statistical alignment tool

Wednesday, October 8th, 2008

ResearchBlogging.orgThere’s an application note in the current issue of Bioinformatics that describes a new tool for statistical alignment, StatAlign, developed in my old group in Oxford.

StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees

Ádám Novák , István Miklós, Rune Lyngsø and Jotun Hein

Bioinformatics 2008 24(20):2403-2404

Motivation: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods.

Results: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.

I am personally a firm believer in statistical alignment.  I think it is the way to go, to deal with the uncertainty in inferred alignments and to avoid the artefacts they can create.

For a good introduction to the problems (and how statistical approaches to alignment can help), you should read Lunter et al. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment Genome Res. 18:298-309, 2008 (or my summary of it here).

StatAlign, the tool in the application note, looks like a nice way to attack alignments. Unlike previous approaches I’ve blogged about — and unlike my own small work in statistical alignment — it deals with multiple sequences (where MCMC is needed besides just HMMs).

It samples over both alignments and phylogenies, which is nice if there is any uncertainty in the phylogeny inference (which is typically based on alignments in the first place).

I can imagine that integrating over the phylogenies in the MCMC is the main time-killer, though, so it could be nice if you can turn that part of the state space exploration off in case you have a reasonable idea about the phylogeny but you are uncertain about some parts of the alignment…


A. Novak, I. Miklos, R. Lyngso, J. Hein (2008). StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees Bioinformatics, 24 (20), 2403-2404 DOI: 10.1093/bioinformatics/btn457

Python 2.6 is out

Wednesday, October 8th, 2008

I just saw that Python version 2.6 came out a few days ago.  See the list of changes here.

I haven’t upgraded yet, and I don’t think I am going to right now.  I didn’t spot any new features I just have to have.  Not like 2.5 where generator expressions were something I’ve missed. And still miss on our cluster at BiRC where we are still running 2.4 :-(

Trying out Boost.Test

Wednesday, October 8th, 2008

I’ve just started a new programming project — a library for dense HMMs that uses parallel hardware for its computations, if you want to know — and I decided to use Boost.Test for my unit testing.

Normally, I just write my unit tests with asserts and maybe a few home-made macros, but since I am going to use Boost heavily in the code anyway (I do more and more these days) I figured I might as well try out its unit testing framework.

Problems with the documentation

To my great surprise, I had some problems with the documentation of the framework.

Usually, the documentation for the boost libraries is excellent — at least compared to most libraries I use — and if you just read the documentation for Boost.Test it looks great.

There is a lot of it, with detailed descriptions of this and that and with tutorials to get you started.

It’s just that the examples there do not work.

Take for example this program from the tutorial:

#define BOOST_TEST_MODULE MyTest
#include <boost/test/unit_test.hpp>

int add( int i, int j ) { return i+j; }

BOOST_AUTO_TEST_CASE( my_test )
{
    // seven ways to detect and report the same error:
    BOOST_CHECK( add( 2,2 ) == 4 );        // #1 continues on error

    BOOST_REQUIRE( add( 2,2 ) == 4 );      // #2 throws on error

    if( add( 2,2 ) != 4 )
      BOOST_ERROR( "Ouch..." );            // #3 continues on error

    if( add( 2,2 ) != 4 )
      BOOST_FAIL( "Ouch..." );             // #4 throws on error

    if( add( 2,2 ) != 4 ) throw "Ouch..."; // #5 throws on error

    BOOST_CHECK_MESSAGE( add( 2,2 ) == 4,  // #6 continues on error
                         "add(..) result: " << add( 2,2 ) );

    BOOST_CHECK_EQUAL( add( 2,2 ), 4 );	  // #7 continues on error
}

The BOOST_AUTO_TEST_CASE() macro should create a test function and plug it into the framework, and after compiling the file (and linking with -lboost_unit_test_framework) you should have a test program.

Well, you can compile the program, but you cannot link it.  There is no main() function.

Oh well, if you read the header file boost/test/unit_test.hpp you find these lines:

#if defined(BOOST_TEST_DYN_LINK) && defined(BOOST_TEST_MAIN) && !defined(BOOST_TEST_NO_MAIN)
int BOOST_TEST_CALL_DECL
main( int argc, char* argv[] )
{
    return ::boost::unit_test::unit_test_main( &init_unit_test, argc, argv );
}

so it seems that the framework will define main, if only you have defined the right symbols, and yes, adding

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN

to the top of the program makes it run.

There were a few other cases like this, where I couldn’t figure out how to use the framework.  Like testing template functions with a list of different template parameters, but I just worked my way around that problem with my own macros.

Using the framework

There’s a lot of different things you can do with Boost.Test, but so far I’ve just used the very basic functionality.

I use the BOOST_AUTO_TEST_CASE() macro for my test functions.  The different cases are automatically grouped into test suites — one per file (compilation unit) — so I don’t worry about the larger framework.  I write a few test cases per code unit I need to test and the rely on the default behaviour of the framework.

The actual testing is done through various BOOST_CHECK_* macros like in the program above.

Among the macros are tests for floating point numbers that lets you test that two numbers are equal up to a certain accuracy.  This is what you want to check, since testing equality of floating point numbers is rarely a good idea.

So far I’m happy with Boost.Test, and I’m going to try out some of the more advanced features as my project progresses, I think.