Archive for March 25th, 2009

Building a database of CoalHMM results

Wednesday, March 25th, 2009

In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees.  We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.

To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.

One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider.  Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.

Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns.  With gigs of rows for both kinds of coordinate systems.

So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns.  The trick is now to have fast queries on the genomic coordinates to get the alignment columns.

I’ve been playing around with this, this afternoon.

Query times without indexing

With just plain tables, with no indexing on the columns, we have the worst case behaviour.  Before trying out any optimisations, I just wanted to figure out how slow queries would be.

So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.

I then timed the queries for random chunks of 1K contiguous positions.  The results are plotted below:

The time doesn’t seem to depend much on the chromosome sizes, but depends significantly on the alignment size.

To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that.  The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.

Query times with simple indexing

Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.

Results below:

This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.

Query times with spatial queries

So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).

This time, the results came out as:

The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments.  Even if there is some dependency, I can still be luck and have a query time of fractions of a second.

Sweet.

Remaining problems

The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:

def _region2poly(c,start,end):
    return 'GeomFromText("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))")' \
            % (c-.1, start-.1,
               c+.1, start-.1,
               c+.1,end+.1,
               c-.1,end+.1,
               c-.1,start-.1)

This means that some of the logic in the database is not actually stored in the database, but in scripts accessing.  Especially annoying since I plan to access it from both Python and R (and possibly C++ where I’m thinking about writing a Qt application for interactive viewing of results).

That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.

Is this something that can be done with MySQL procedures?  I have never used those, and am really a novice when it comes to databases, so I really have no idea…

84-104=-20

MySQLdb (Python module) on OSX

Wednesday, March 25th, 2009

I’m working on a database for our CoalHMM analysis results.  Right now, I’m playing around with different database designs, so I want to play around with the database on my local machine rather than our shared server.

I used to just do this on Linux, but this is the first time I’m playing with MySQL on OS X.

Installing MySQL was pretty painless as you can download a binary package for it.

MySQLdb, the Python module I use to access the database, is a different matter, however.  I couldn’t find any binary packages and the source distribution doesn’t compile out of the box.

Luckily, I am not the first to want to use MySQLdb on a Mac, and a bit of googling found this page.

Worked like a charm, so now I am back to hacking data base designs.

84-103=-19

Coolest paper in a while, and I feel left out…

Wednesday, March 25th, 2009

This paper seems to be the coolest population genetics out in a while:

I have downloaded it, but I just haven’t had time to read it yet, and probably won’t have time to read it until the weekend.  I’m swamped with work, and I have three journal clubs Thursday and Friday, so the reading I do have time for goes to the papers for those.

I just can’t believe that I am postponing reading this paper to read about fungi genomics, of all things…

84-102=-18

Everything you know is wrong

Wednesday, March 25th, 2009

A post at Discovering Biology in a Digital World reminded me of this Weird Al song:

Not that the post is a parody or anything.  It is actually quite serious.

It lists a few of the things we “knew” about the genome, that turned out to be wrong once we improved our measurement technology, such as the ubiquity of structural variation or RNA genes.

84-101=-17