Archive for October, 2008

Python 2.6 is out

Wednesday, October 8th, 2008

I just saw that Python version 2.6 came out a few days ago.  See the list of changes here.

I haven’t upgraded yet, and I don’t think I am going to right now.  I didn’t spot any new features I just have to have.  Not like 2.5 where generator expressions were something I’ve missed. And still miss on our cluster at BiRC where we are still running 2.4 :-(

Trying out Boost.Test

Wednesday, October 8th, 2008

I’ve just started a new programming project — a library for dense HMMs that uses parallel hardware for its computations, if you want to know — and I decided to use Boost.Test for my unit testing.

Normally, I just write my unit tests with asserts and maybe a few home-made macros, but since I am going to use Boost heavily in the code anyway (I do more and more these days) I figured I might as well try out its unit testing framework.

Problems with the documentation

To my great surprise, I had some problems with the documentation of the framework.

Usually, the documentation for the boost libraries is excellent — at least compared to most libraries I use — and if you just read the documentation for Boost.Test it looks great.

There is a lot of it, with detailed descriptions of this and that and with tutorials to get you started.

It’s just that the examples there do not work.

Take for example this program from the tutorial:

#define BOOST_TEST_MODULE MyTest
#include <boost/test/unit_test.hpp>

int add( int i, int j ) { return i+j; }

BOOST_AUTO_TEST_CASE( my_test )
{
    // seven ways to detect and report the same error:
    BOOST_CHECK( add( 2,2 ) == 4 );        // #1 continues on error

    BOOST_REQUIRE( add( 2,2 ) == 4 );      // #2 throws on error

    if( add( 2,2 ) != 4 )
      BOOST_ERROR( "Ouch..." );            // #3 continues on error

    if( add( 2,2 ) != 4 )
      BOOST_FAIL( "Ouch..." );             // #4 throws on error

    if( add( 2,2 ) != 4 ) throw "Ouch..."; // #5 throws on error

    BOOST_CHECK_MESSAGE( add( 2,2 ) == 4,  // #6 continues on error
                         "add(..) result: " << add( 2,2 ) );

    BOOST_CHECK_EQUAL( add( 2,2 ), 4 );	  // #7 continues on error
}

The BOOST_AUTO_TEST_CASE() macro should create a test function and plug it into the framework, and after compiling the file (and linking with -lboost_unit_test_framework) you should have a test program.

Well, you can compile the program, but you cannot link it.  There is no main() function.

Oh well, if you read the header file boost/test/unit_test.hpp you find these lines:

#if defined(BOOST_TEST_DYN_LINK) && defined(BOOST_TEST_MAIN) && !defined(BOOST_TEST_NO_MAIN)
int BOOST_TEST_CALL_DECL
main( int argc, char* argv[] )
{
    return ::boost::unit_test::unit_test_main( &init_unit_test, argc, argv );
}

so it seems that the framework will define main, if only you have defined the right symbols, and yes, adding

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN

to the top of the program makes it run.

There were a few other cases like this, where I couldn’t figure out how to use the framework.  Like testing template functions with a list of different template parameters, but I just worked my way around that problem with my own macros.

Using the framework

There’s a lot of different things you can do with Boost.Test, but so far I’ve just used the very basic functionality.

I use the BOOST_AUTO_TEST_CASE() macro for my test functions.  The different cases are automatically grouped into test suites — one per file (compilation unit) — so I don’t worry about the larger framework.  I write a few test cases per code unit I need to test and the rely on the default behaviour of the framework.

The actual testing is done through various BOOST_CHECK_* macros like in the program above.

Among the macros are tests for floating point numbers that lets you test that two numbers are equal up to a certain accuracy.  This is what you want to check, since testing equality of floating point numbers is rarely a good idea.

So far I’m happy with Boost.Test, and I’m going to try out some of the more advanced features as my project progresses, I think.

A short introduction to the Human Genome

Tuesday, October 7th, 2008

At BiRC we have a small study group where we are reading Michael Lynch’s book, The Origins of Genome Architecture.  We take turns presenting a chapter, and last time it was my turn and the chaper (chapter 3) is on the Human Genome.

Below are my notes.  I’ve tried to translate basepairs and percentages into meters, ’cause it helps me visualise the relative sizes.  Even though it still boggles my mind a bit that the human genome is about 2m if you stretch it out.  Well, that can’t be helped.  Three billion is a lot, even if it is three billion very tiny things…

The human genome

The human genome is about 3Gbp which is about 2m if you stretch it out.  Of this, about 1%, or 2cm, is coding. That doesn’t mean that only 1% is genes, ’cause most of genes (even protein coding genes) are not actually coding.  We’ll get to that later.

There is nothing unusual about the human genome.  Not compared to other multicelular eukariotes, at least.  It is not the largest or the smallest and it does not have an unual number of genes.  In fact, it is common as mud.

The genes are grouped into families, where the families are phylgenetically related.  Some people have argued that the number of genes in a family tells you something about the importance of the family, but the distribution of family sizes could just as easily be explained by a stochastic birth/death process, at least if the gene families have different birth/death ratios.  (This begs the question of why they should have that, but the chapter doesn’t say and I don’t know).

Introns and exons

The protein coding genes consist of introns and exons.  The exons are what is left after the transcript is spliced and the introns is what is thrown out.

A personal comment: remember that exon is not synonymous with coding.  I’ve seen people confused about this (and made the mistake myself in a script or two).  The coding sequence starts somewhere along the sequence of exons and stops somewhere before the exons end. The coding sequence is a sub-sequence of the exons.

Back to the chapter…

The average exon length is about 0.15kbp while the average intron length is about 4.66kbp. So introns are on average about 40 times as long as exons.  If we imagine (not to scale this time) that the exons are 1 cm long, then we have 40 cm of introns between them.

That is a lot of intron…

Since complex organisms doesn’t seem to have dramatically more coding genes than simpler organisms, alternative splicing has been suggested to explain the complexity.

On average we have 2.6 different splice products per gene.  As far as we know, there is less alternative splicing in C. elegans and D. melanogaster, but then we know a lot more about alternative splicing in humans and mice that we do in any other organism, so we are boundt to have seen more rare splice variants here and the extra splicing we are seeing might easily be a selection bias.

We don’t really know how much of the alternative splicing is functionally important and how much is random “noise”.

Regulatory DNA

How much regulatory DNA do we have?  Based on conserved regions in the genome (which is probably a very conservative estimate) we have a few percent of the genome.

It seems, however, that the fraction of the genome that is conserved increases with organism complexity, so perhaps complexity is all in regulation?

Everything and the kitchen sink seems to be transcribed, and most of it differentially expressed, but how much of this is spurious we do not know.

We do know about different non-coding RNA genes, such as miRNA.  Although miRNA is only found in multi-cellular organisms, RNA interference is found in all domains of life.

Even accounting for regulatory DNA, 95% of the genome — 1.9 m — has no function that we know of.  Know of being the key word here, of course.  Don’t call it junk before you know that it is really doing nothing…

Mobile elements

There are about 100 times as many mobile elements as there are coding genes, and 75% — 1.5 meter — of the genome is a product of some mobile element or other.

(I have a lot of notes about the different kinds, but quite frankly I find it a bit dull, so I am not going to mention it here…)

Pseudogenes

There are about half as many pseudogenes as coding genes.

There are various mechanisms for introducing pseudogenes in the genome, including re-insertion of cDNA, tandem duplications and just plain inactivation of a gene.

The first two cases are likely to introduce pseudogenes “dead on arrival”.  A re-inserted cDNA won’t have the regulatory mechanisms to get transcribed, and a duplication is likely to disrupt it as well.

Pseudogenes seem to have a halflife of ~37MYA (halflife of the time it takes till we don’t recognize it as a copy of another gene) compared to a halflife of ~7.5MYA for active duplicated genes.  This seems to indicate that there isn’t much selection working against dead-on-arrival genes compared to duplicating an active gene and thus potentially doubling its product.

Human evolution

The chapter closes with an analysis of the human lineage and our evolution.  (Can we finally find something that makes us special?)

It seems that we have seen an increase in substitution rate on our line, but Lynch argues that what we are seeing is not so much adaptive selection but rather a reduced negative selection.

There can be different explanations for this.  A reduced population size can have reduced the effects of selection. A change in behaviour could have changed our fitness landscape, so traits that would normally be selected strongly against changing are now free to vary.

Damn stupid bug

Tuesday, October 7th, 2008

The last week I’ve been running scripts to analyse a whole genome alignment, and I got some weird results.  In some parts of the alignment, I was seeing way too much sequence divergence than there should be.

Now, manually inspecting 2Gbp of alignment is not an option, so I wrote a program to plot various summaries of chunks of the alignment.

Still, it is a lot of alignment, so scanning the genome with the plots didn’t work either, although the plots clearly showed that something was wrong.

I’ve been scratching my head about this for a while, and it wasn’t until I was heading to bed yesterday that it hit me: parts of the sequences are repeat-masked, so the case varies in the alignment. I consider A != a so I am seeing more variation than is really there!

I feel so stupid now.  If only I had taken a look at the actual sequences, this would have been obvious! But I didn’t, I wrote a program to summaries the data instead.

Stupid stupid stupid…

How much is that in Apollos?

Sunday, October 5th, 2008

I just have to share this comparison of the finacial crisis bailout and large-scale scientific programs:

… For example, you could build 70 Large Hadron Colliders (or maybe just one Ultra-Humongous Hadron Collider).

By the way, the entire US space effort through Apollo was about 100 billion in current dollars,  so we could do seven Apollos for the amount of the bailout.

You know, a few hundred billions here and a fwe hundred billions there, all of a sudden we’re talking real money!