Advancing science through blogging

There’s a paper in the latest issue of PLoS Biology about blogging: Advancing Science through Conversations: Bridging the Gap between Blogs and the Academy

The topic is how blogs can be used as an outreach from scientists to the public.  That sounds fair enough.  There’s also a bit about how this can be formalised.

I’m not sure I would appreciate that much.  I like my blogs informal.  For formal stuff, I’ll stick to journals.

There’s a detailed comment on the paper at DrugMonkey.  By and large I agree with the comments there…

Profiling Python

I’ve spent today writing some scripts to analyse a genome. I have about 2 giga bases of alignment, so it can be pretty slow to extract information from it.  Especially since the cluster here at BiRC is in heavy use these days, so I cannot gain much from parallelising the analysis.

Psyco and JIT compiling

I had hoped that I could gain the speed I needed using Psyco.  I’ve been lucky with that in the past, getting about x10 in speed.

It is a JIT (just-in-time) compiler that actually compiles (generate machine code) the code rather than interpret it.

With languages such as C++ you need to compile the code more or less manually before you can run it.  That translate the high-level code into machine code that can actually run on the machine.

In Java you also compile the high-level code, but there you compile it into something called byte code that cannot actually run on the machine.  Instead the byte code can be interpreted in a Java Virtual Machine (JVM).  This is how you get their “compile once, run everywhere”.

This is actually also what Python does, you probably just haven’t noticed, because Python does it automatically. When it loads a script, it first translates it into byte code and then it interprets it.  If you load a module, you might have noticed a .pyc file.  That is the byte code for that module.  But you only see these files for modules, ’cause Python doesn’t write the byte code to file for your scripts.

For modern virtual machines, like JVM, the byte code is not just interpreted.  The machine will analyse how the program executes and compile important functions — the hot spots — into machine code for faster execution.  This is what is called just in time compilation (sometimes just too late, but that is a bit cruel).

This, by the way, is also the technology running under the hood of Google’s chrome browser.  There, the virtual machine runs Javascript, but the idea is essentially the same as for the JVM.

Lars Bak, the lead engineer behind that virtual machine, explains it here:

To stray a bit from the main point I can mention that I had a class some years ago taught by Lars here in Aarhus.  Many of the people now working with Lars had the same class, so that was a successful class.  He is going to give it again this year, so don’t you just wish you were studying computer science at our university right now? ;-)

Anyway, back to Python.  Here you can also get JIT compiling using the module psyco.  All you have to is add these two lines to your program:

import psyco

You can get finer control over the JIT compilation by doing a little more, but this is the simplest way to get it going, and usually it pays off nicely.  It is a bit memory hungry, so it isn’t always the thing to do, but more often than not, it is.

You get the most out of it when your code is actually doing something serious.  Lots of loops and branches and such.  Typical algorithmic code.

If you are mainly calling built in functions, you do not gain as much.  Those are already compiled, typically, so they are pretty optimised as it is.

For my scripts, I am mainly scanning through the alignment and updating statistics in a bunch of tables, and I didn’t gain more than maybe a factor of two from psyco.  Not really enough for the speed I need.

Profiling Python

Before you go ahead and paint “go faster stripes” on your code, you should always profile it.  You will be surprised at where the majority of time is spent.

In Python you have three modules for profiling: profile, cProfile and hotshot.  The first two have the same functionality, but cProfile is written in C and is faster.

I typically use cProfile.  The hotshot module is not maintained (according to the Python documentation) and the documentation certainly isn’t, so I have never quite figured out how to use it.  I don’t have to either, though, ’cause cProfile does all I need it to.

With cProfile you can load the module in and have fine control over what you profile, but I find it the easiest just to profile the entire script.  You can do this by loading the module directly when you call your script, like this:

python -m cProfile

After your program has finished, cProfile will write a summary of where the time was spent in the run.

If you want to analyse the result in more detail, you can write the profiler’s analysis to a file

python -m cProfile -o

that you can then read into Python using the module pstats

>>> import pstats
>>> p = pstats.Stats('')

and you can then sort the report using various criteria.  For example, to get the top 10 runtime sinners, you can do:

>>> p.sort_stats('time').print_stats(10)
Tue Sep 23 15:09:35 2008
         7932274 function calls in 23.290 CPU seconds
   Ordered by: internal time
   List reduced from 119 to 10 due to restriction <10>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.303   10.303   23.276   23.276<module>)
  1314684    7.036    0.000   11.282    0.000
  5258736    4.246    0.000    4.246    0.000<genexpr>)
   377622    1.187    0.000    1.215    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/
   862137    0.391    0.000    0.391    0.000
     9991    0.028    0.000    0.028    0.000 {range}
    39960    0.022    0.000    0.022    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/
        1    0.014    0.014   23.290   23.290 {execfile}
    10989    0.014    0.000    0.014    0.000
     4995    0.010    0.000    0.013    0.000<genexpr>)

Here the topmost line is the entire script, and that is using all the time.  Not really a surprise.  The next three or four lines tells me where I should put my optimisation efforts.

(If you are now wondering why I bother optimising something that takes 23 CPU seconds then congratulations for paying attention to the report and being sceptical about when to bother optimising.  I am only running my scripts on a very tiny fraction of my data for the profiling — anything else would take much too long — so that is why.  For the real data, optimising really is needed).