Building a database of CoalHMM results
In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees. We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.
To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.
One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider. Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.
Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns. With gigs of rows for both kinds of coordinate systems.
So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns. The trick is now to have fast queries on the genomic coordinates to get the alignment columns.
I've been playing around with this, this afternoon.
Query times without indexing
With just plain tables, with no indexing on the columns, we have the worst case behaviour. Before trying out any optimisations, I just wanted to figure out how slow queries would be.
So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.
I then timed the queries for random chunks of 1K contiguous positions. The results are plotted below:
The time doesn't seem to depend much on the chromosome sizes, but depends significantly on the alignment size.
To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that. The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.
Query times with simple indexing
Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.
Results below:
This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.
Query times with spatial queries
So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).
This time, the results came out as:
The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments. Even if there is some dependency, I can still be luck and have a query time of fractions of a second.
Sweet.
Remaining problems
The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:
def _region2poly(c,start,end):
return 'GeomFromText("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))")' \
% (c-.1, start-.1,
c+.1, start-.1,
c+.1,end+.1,
c-.1,end+.1,
c-.1,start-.1)
This means that some of the logic in the database is not actually stored in the database, but in scripts accessing. Especially annoying since I plan to access it from both Python and R (and possibly C++ where I'm thinking about writing a Qt application for interactive viewing of results).
That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.
Is this something that can be done with MySQL procedures? I have never used those, and am really a novice when it comes to databases, so I really have no idea...
--
84-104=-20



March 25th, 2009 at 11:34 pm
Generally, you can get a lot of performance by picking the right storage engine (if you do few updates and a lot of reads, then pick MyISAM, else pick InnoDB). Also tweaking the parameters can help a lot (for example, ensure that your data / indexes can fit into memory).
It's also pretty important to pick the right data types with MySQL - for example, MySQL has shit loads of integer types, while a database like Oracle only has 1 :-p
Also, you can do "explain QUERY;" to see how MySQL uses the indexes - you can be surprised, or at least I am sometimes :-) This all said, doing "funky database schemas" is not recommended with MySQL.
It would have helped if you wrote how your structure looked like and what indexes you had tried with.
March 26th, 2009 at 7:43 am
Hi Amir, I did try both MyISAM and InnoDB (except for the spatial queries that only works with MyISAM last I checked, so I didn't try with InnoDB this time around).
I'll just dig up the schemas and post them in a min...
March 26th, 2009 at 7:51 am
The alignment table in all cases was:
and the indexed version of the genomic coordinates was
CREATE TABLE IF NOT EXISTS `orangutan_coalhmm`.`orangutan_genome` ( `chromosome` VARCHAR(2) NOT NULL , `position` INT UNSIGNED NOT NULL , `alignment_pos` INT UNSIGNED NOT NULL , PRIMARY KEY (`chromosome`, `position`) , CONSTRAINT `fk_orangutan_loci_posterior_probabilities` FOREIGN KEY (`alignment_pos` ) REFERENCES `orangutan_coalhmm`.`posterior_probabilities` (`alignment_pos` ) ON DELETE NO ACTION ON UPDATE NO ACTION)while the spatial queries tabel was
CREATE TABLE IF NOT EXISTS `orangutan_coalhmm`.`orangutan_genome` ( `chromosome` VARCHAR(2) NOT NULL , `position` INT UNSIGNED NOT NULL , `alignment_pos` INT UNSIGNED NOT NULL , `point` POINT NOT NULL , INDEX `fk_orangutan_loci_posterior_probabilities` (`alignment_pos` ASC) , PRIMARY KEY (`chromosome`, `position`) , SPATIAL INDEX `point` USING RTREE (`point` ASC) , CONSTRAINT `fk_orangutan_loci_posterior_probabilities` FOREIGN KEY (`alignment_pos` ) REFERENCES `orangutan_coalhmm`.`posterior_probabilities` (`alignment_pos` ) ON DELETE NO ACTION ON UPDATE NO ACTION)Of course the foreign key constraints are not checked with the MyISAM engine.
March 26th, 2009 at 10:50 am
...or you could use a binning scheme and index the bin column - which only requires a binning module to translate bins to/from coordinates.
I don't know if it will scale as good as the spatial queries though.
March 26th, 2009 at 11:04 am
The binning approach means I essentially have to implement my own interval trees. I doubt I can do that quite as efficiently as the MySQL guys have already done ;-)
Plus, I would still have to have a binning scheme per chromosome.
I think that solution would be a lot more complicated than the relatively simple spatial query trick.
March 26th, 2009 at 4:14 pm
[...] If you recall, I store genomic positions as geometric points so I can efficiently query regions. This means I have to store a "point" together with each genomic position, and that I need to create "rectangles" when I'm querying a genomic segment. [...]
April 2nd, 2009 at 10:30 pm
[...] just found out that the experiments I used to figure out the feasibility of analysing the CoalHMM runs through a database were buggy. I noticed that the query time once we started filling in actual [...]
August 19th, 2011 at 5:35 pm
I'm coming back to querying huge sets of features on the human genome... and am wondering if anyone has seen benchmarks between querying spatial indexed data vs. indexed VCF files?
Before I go and run these tests, figured I'd poke this old thread -- (and might have results to post soon)
john