Playing with spatial queries in MySQL

I’m working on a genome analysis project where we need to correlate a number of genome features.  We have a lot of alignment chunks that we’ve analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.

To do this, I am going to take John Major’s advice and use an SQL database.  We have a MySQL database rather than a PostgreSQL database, so I couldn’t just copy his hints, but I’ve played with doing the same thing with MySQL today.

To map each chunk to coordinates in each species in the alignment, I am making a table per species.  This will hold the chunk id together with the chunk locus (chromosome, start index and end index).

Explicit region representation

To try it out, I first made a table that explicitly stores this information:

and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:

The results came out as this, when I ran it on my laptop:

Spatial queries

Next, I tried the spatial queries suggested by John.

The idea is to store the loci as 2D boxes and use MySQL’s geometric data structures to speed up the query.  If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups.  (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database…)

Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.

The table is defined as:

and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):

The time for this is much better than the explicit representation:

Especially considering that the number of entries in the database doesn’t seem to affect the query time much.  This is really good news, considering that I have lots of entries.

Getting genome coordinates back from the polygons?

Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon.  I haven’t figured that out, and I cannot find anything in the documentation about it…

Maybe I’ll just have to store the explicit information in the table, together with the polygon and only use the latter for queries… does anyone know how I can get it from the polygon?