Building a database of CoalHMM results
In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees. We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.
To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.
One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider. Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.
Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns. With gigs of rows for both kinds of coordinate systems.
So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns. The trick is now to have fast queries on the genomic coordinates to get the alignment columns.
I've been playing around with this, this afternoon.
Query times without indexing
With just plain tables, with no indexing on the columns, we have the worst case behaviour. Before trying out any optimisations, I just wanted to figure out how slow queries would be.
So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.
I then timed the queries for random chunks of 1K contiguous positions. The results are plotted below:
The time doesn't seem to depend much on the chromosome sizes, but depends significantly on the alignment size.
To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that. The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.
Query times with simple indexing
Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.
This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.
Query times with spatial queries
So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).
This time, the results came out as:
The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments. Even if there is some dependency, I can still be luck and have a query time of fractions of a second.
The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:
def _region2poly(c,start,end): return 'GeomFromText("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))")' \ % (c-.1, start-.1, c+.1, start-.1, c+.1,end+.1, c-.1,end+.1, c-.1,start-.1)
This means that some of the logic in the database is not actually stored in the database, but in scripts accessing. Especially annoying since I plan to access it from both Python and R (and possibly C++ where I'm thinking about writing a Qt application for interactive viewing of results).
That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.
Is this something that can be done with MySQL procedures? I have never used those, and am really a novice when it comes to databases, so I really have no idea...