Archive for March, 2009

Unladen Swallow

Friday, March 27th, 2009

At Google they are trying to give Python a real boost.  See Ars Technica’s post on the project.

The project is called Unladen Swallow, a reference to Monty Python and the Holy Grail, and the goals are:

We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.

  1. Produce a version of Python at least 5x faster than CPython.
  2. Python application performance should be stable.
  3. Maintain source-level compatibility with CPython applications.
  4. Maintain source-level compatibility with CPython extension modules.
  5. We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.

The main approach is to use LLVM (an open source virtual machine) and JIT compilation to speed up the code.  This is probably a good idea.  JIT approaches (plus dynamic runtime optimisation) has done wonders for Java and is under the hood of Google’s Chrome browser in the virtual machien V8.

More interesting, though, in my opinion is that they want to support multi-core machines by getting rid of the global interpreter lock (GIL).  Because of global synchronisation issues, multi-threading in Python isn’t quite as parallel as you might think.  It is fine for system calls without blocking, but not really for exploiting multiple cores.  But see Multiprocessing with Python (I wanted to write a separate post on that, but probably won’t have time, so now I’ll just link to it here).

Moore’s law is dead.  Processors are not getting faster, they are just getting more cores. See also Herb Sutter’s The Free Lunch is Over.  Multi-core software is going to be essential for high performance in the future, and by handling this in the VM for Python, rather than running separate processes, we might see runtime parallelisation optimisation.  That would be really exciting!

86-108=-22

Stored procedures in my CoalHMM database

Thursday, March 26th, 2009

Ok, I managed to hide the spatial queries behind stored procedures.  Good news, since now I don’t have to write this functionality in the different languages I will access the database in.

If you recall, I store genomic positions as geometric points so I can efficiently query regions.  This means I have to store a “point” together with each genomic position, and that I need to create “rectangles” when I’m querying a genomic segment.

Storing a point together with a genomic position, I handle as this:

CREATE PROCEDURE insert_orangutan_position (
  chromosome TINYINT ,
  chromosome_pos INT ,
  alignment_pos INT )
BEGIN
  DECLARE point POINT;
  SET point = GeomFromText(CONCAT('POINT(', chromosome, ' ', chromosome_pos, ')'));
  INSERT INTO orangutan_genome (chromosome, chromosome_pos, alignment_pos, genomic_point)
  VALUES (chromosome, chromosome_pos, alignment_pos, point);
END|

and querying I handle like this:

CREATE PROCEDURE query_orangutan_segment (
  chromosome TINYINT ,
  chromosome_start INT ,
  chromosome_end INT)
BEGIN
  DECLARE region POLYGON;
  SET region = GeomFromText(CONCAT('POLYGON((',
                             chromosome-.1, ' ', chromosome_start-.1, ', ',
                             chromosome+.1, ' ', chromosome_start-.1, ', ',
                             chromosome+.1, ' ', chromosome_end+.1, ', ',
                             chromosome-.1, ' ', chromosome_end+.1, ', ',
                             chromosome-.1, ' ', chromosome_start-.1,
                        '))'));
  SELECT chromosome, chromosome_pos, HC1, HC2, HO, CO
  FROM orangutan_genome INNER JOIN posterior_probabilities USING(alignment_pos)
  WHERE CONTAINS(region, genomic_point)
  ORDER BY chromosome_pos;
END|

I still need to write some accessing functions in my script languages, but those are very simple wrappers now.

85-107=-22

Reading about MySQL stored procedures

Thursday, March 26th, 2009

I need to finish the database of CoalHMM results this week, so we can start the analyses next week, so today I will write the code for querying genomic position with spatial indices.

I’ll first try with stored procedures, something I know absolutely nothing about, so I’ve googled for tutorials and found:

If you know of any better, please let me know.

If all this fails, I’ll have to hack up a Python module and some R code for accessing the database.  I’ll probably need that anyway, but the more I can share between them by putting it in the database the better.

85-106=-21

Not really my kind of game, but interesting…

Thursday, March 26th, 2009

Check out Make My Head Grow!

The object is to grow your head (by slamming it into the floor) and then use your gigantic head to push the opponent out of the game (by slamming your head against the wall).

Weird, but it managed to win all five categories of the Nordic Game Jam.

It is not really my kind of game — I prefer strategy games — but since four of the seven authors are students here in Århus, I thought I should mention it anyway.

85-105=-20

Building a database of CoalHMM results

Wednesday, March 25th, 2009

In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees.  We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.

To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.

One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider.  Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.

Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns.  With gigs of rows for both kinds of coordinate systems.

So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns.  The trick is now to have fast queries on the genomic coordinates to get the alignment columns.

I’ve been playing around with this, this afternoon.

Query times without indexing

With just plain tables, with no indexing on the columns, we have the worst case behaviour.  Before trying out any optimisations, I just wanted to figure out how slow queries would be.

So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.

I then timed the queries for random chunks of 1K contiguous positions.  The results are plotted below:

The time doesn’t seem to depend much on the chromosome sizes, but depends significantly on the alignment size.

To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that.  The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.

Query times with simple indexing

Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.

Results below:

This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.

Query times with spatial queries

So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).

This time, the results came out as:

The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments.  Even if there is some dependency, I can still be luck and have a query time of fractions of a second.

Sweet.

Remaining problems

The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:

def _region2poly(c,start,end):
    return 'GeomFromText("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))")' \
            % (c-.1, start-.1,
               c+.1, start-.1,
               c+.1,end+.1,
               c-.1,end+.1,
               c-.1,start-.1)

This means that some of the logic in the database is not actually stored in the database, but in scripts accessing.  Especially annoying since I plan to access it from both Python and R (and possibly C++ where I’m thinking about writing a Qt application for interactive viewing of results).

That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.

Is this something that can be done with MySQL procedures?  I have never used those, and am really a novice when it comes to databases, so I really have no idea…

84-104=-20