Archive for September, 2008

Preventing a pandemic — Google style

Wednesday, September 24th, 2008

The latest post in Google’s “Google at 10” series concerns epidemics.

Well, it mainly concerns eradicating infectious diseases with smallpox given as an example.

In case you’ve forgotten, smallpox was eradicated in the late 1970s (last case in 1977) through a vaccination program.  My parents received the vaccine but with the last case in Denmark in 1970 I didn’t (I was born in 1975).

It is a bit cool to think about it.  A disease was important enough to cause a global vaccination program in my parents generation, but for me there was no point; the disease had been wiped out.  I was vaccinated against TB (Calmette vaccination) but my younger sister weren’t ’cause by that time there were so few cases in Denmark that it wasn’t worth it.

We are getting pretty good at this.

Now, in the Google blog post, Dr. Larry Brilliant compares smallpox to the Black Death and bird flu.  That is a bit dramatic, I think.  Well, maybe not the bird flu — we don’t know how deadly that will be — but the Black Death was a bit more deadly.  Maybe not in the long run — smallpox has killed its share of people — but in the short run a pandemic like the Black Death (and more so the Spanish flu) is a lot more worrying.

To identify potentially emerging epidemics, Larry mentions Google’s Predict and Prevent initiative.

The post is a bit short on visions, at least compared to the earlier Google at 10 posts.

Epidemics (and pandemics) is an important issue, and something like Predict and Prevent can be important.  But what will it do in the future? What are the visions?

Gene-geography video

Tuesday, September 23rd, 2008

A youtube video about this paper:

You might also want to check out this paper.

Advancing science through blogging

Tuesday, September 23rd, 2008

There’s a paper in the latest issue of PLoS Biology about blogging: Advancing Science through Conversations: Bridging the Gap between Blogs and the Academy

The topic is how blogs can be used as an outreach from scientists to the public.  That sounds fair enough.  There’s also a bit about how this can be formalised.

I’m not sure I would appreciate that much.  I like my blogs informal.  For formal stuff, I’ll stick to journals.

There’s a detailed comment on the paper at DrugMonkey.  By and large I agree with the comments there…

Profiling Python

Tuesday, September 23rd, 2008

I’ve spent today writing some scripts to analyse a genome. I have about 2 giga bases of alignment, so it can be pretty slow to extract information from it.  Especially since the cluster here at BiRC is in heavy use these days, so I cannot gain much from parallelising the analysis.

Psyco and JIT compiling

I had hoped that I could gain the speed I needed using Psyco.  I’ve been lucky with that in the past, getting about x10 in speed.

It is a JIT (just-in-time) compiler that actually compiles (generate machine code) the code rather than interpret it.

With languages such as C++ you need to compile the code more or less manually before you can run it.  That translate the high-level code into machine code that can actually run on the machine.

In Java you also compile the high-level code, but there you compile it into something called byte code that cannot actually run on the machine.  Instead the byte code can be interpreted in a Java Virtual Machine (JVM).  This is how you get their “compile once, run everywhere”.

This is actually also what Python does, you probably just haven’t noticed, because Python does it automatically. When it loads a script, it first translates it into byte code and then it interprets it.  If you load a module, you might have noticed a .pyc file.  That is the byte code for that module.  But you only see these files for modules, ’cause Python doesn’t write the byte code to file for your scripts.

For modern virtual machines, like JVM, the byte code is not just interpreted.  The machine will analyse how the program executes and compile important functions — the hot spots — into machine code for faster execution.  This is what is called just in time compilation (sometimes just too late, but that is a bit cruel).

This, by the way, is also the technology running under the hood of Google’s chrome browser.  There, the virtual machine runs Javascript, but the idea is essentially the same as for the JVM.

Lars Bak, the lead engineer behind that virtual machine, explains it here:

To stray a bit from the main point I can mention that I had a class some years ago taught by Lars here in Aarhus.  Many of the people now working with Lars had the same class, so that was a successful class.  He is going to give it again this year, so don’t you just wish you were studying computer science at our university right now? ;-)

Anyway, back to Python.  Here you can also get JIT compiling using the module psyco.  All you have to is add these two lines to your program:

import psyco
psyco.full()

You can get finer control over the JIT compilation by doing a little more, but this is the simplest way to get it going, and usually it pays off nicely.  It is a bit memory hungry, so it isn’t always the thing to do, but more often than not, it is.

You get the most out of it when your code is actually doing something serious.  Lots of loops and branches and such.  Typical algorithmic code.

If you are mainly calling built in functions, you do not gain as much.  Those are already compiled, typically, so they are pretty optimised as it is.

For my scripts, I am mainly scanning through the alignment and updating statistics in a bunch of tables, and I didn’t gain more than maybe a factor of two from psyco.  Not really enough for the speed I need.

Profiling Python

Before you go ahead and paint “go faster stripes” on your code, you should always profile it.  You will be surprised at where the majority of time is spent.

In Python you have three modules for profiling: profile, cProfile and hotshot.  The first two have the same functionality, but cProfile is written in C and is faster.

I typically use cProfile.  The hotshot module is not maintained (according to the Python documentation) and the documentation certainly isn’t, so I have never quite figured out how to use it.  I don’t have to either, though, ’cause cProfile does all I need it to.

With cProfile you can load the module in and have fine control over what you profile, but I find it the easiest just to profile the entire script.  You can do this by loading the module directly when you call your script, like this:

python -m cProfile your-script.py

After your program has finished, cProfile will write a summary of where the time was spent in the run.

If you want to analyse the result in more detail, you can write the profiler’s analysis to a file

python -m cProfile -o profile.report your-script.py

that you can then read into Python using the module pstats

>>> import pstats
>>> p = pstats.Stats('profile.report')

and you can then sort the report using various criteria.  For example, to get the top 10 runtime sinners, you can do:

>>> p.sort_stats('time').print_stats(10)
Tue Sep 23 15:09:35 2008    profile.report
         7932274 function calls in 23.290 CPU seconds
   Ordered by: internal time
   List reduced from 119 to 10 due to restriction <10>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.303   10.303   23.276   23.276 extract-statistics.py:2(<module>)
  1314684    7.036    0.000   11.282    0.000 extract-statistics.py:47(is_singleton)
  5258736    4.246    0.000    4.246    0.000 extract-statistics.py:50(<genexpr>)
   377622    1.187    0.000    1.215    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/core.py:125(column_iter)
   862137    0.391    0.000    0.391    0.000 extract-statistics.py:40(is_informative)
     9991    0.028    0.000    0.028    0.000 {range}
    39960    0.022    0.000    0.022    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/core.py:193(__init__)
        1    0.014    0.014   23.290   23.290 {execfile}
    10989    0.014    0.000    0.014    0.000 extract-statistics.py:53(windows)
     4995    0.010    0.000    0.013    0.000 extract-statistics.py:28(<genexpr>)

Here the topmost line is the entire script, and that is using all the time.  Not really a surprise.  The next three or four lines tells me where I should put my optimisation efforts.

(If you are now wondering why I bother optimising something that takes 23 CPU seconds then congratulations for paying attention to the report and being sceptical about when to bother optimising.  I am only running my scripts on a very tiny fraction of my data for the profiling — anything else would take much too long — so that is why.  For the real data, optimising really is needed).

Maybe I’m not crazy after all…

Sunday, September 21st, 2008

This evening I was reading in Pattern Recognition and Machine Learning, the book we use in our machine learning class.  We only use the first half of the book, but we are thinking about extending the class to cover two terms and then cover the entire book (or most of it, anyway) so I figured this was a good excuse to actually read the whole book.  So far, I’ve only read the chapters we actually use, plus a few pages here and there.

Anyway, I was reading chapter 6, on kernel methods, but I got stuck on the first figure.

It is supposed to illustrate kernel functions k(x,x’) as linear combinations of feature functions: k(x,x’)=Σφi(xi(x’). The top row shows the feature functions, φi(x), and the bottom row the kernel function, as a function of x with x’ fixed at 0.

That doesn’t make any sense at all to me.

On the left-most figure, the feature functions are all 0 for x’=0, so the kernel function is a sum of zeroes.  It should be constant zero, not the curvy blue line.

For the other two, the feature functions are all non-negative, so how can the kernel function ever be negative?  A product of non-negative values cannot be negative, and neither can the sum of non-negative numbers.

In short, the figure is all wrong.  There isn’t a single thing right about it.

That was my reasoning, in any case, but I wasn’t completely sure.  I could be missing something.

So I googled for the book, but then I found powerpoint presentations including the figure, with no mentioning of any errors.  Clearly someone was using the figure in their teaching, so maybe it wasn’t wrong after all.

It got me nervous.  I feel that I really need to understand something to teach it, so I expect other people to feel the same way, and someone had used this figure.

I am not mentioning names here, ’cause as you have probably guessed the figure is wrong.  There is nothing wrong with my reasoning above.

Well, another minutes Googling found me the errata list, and sure enough, the figure is fixed there.

I’m happy to find that I hadn’t completely misunderstood the topic and that I was right about the figure.

I am a little disappointed that a teacher would use the figure without at least checking that the figure actually makes sense.  Showing an example that makes no sense at all is doing a lot of harm to the students…