Playing with spatial queries in MySQL
I'm working on a genome analysis project where we need to correlate a number of genome features. We have a lot of alignment chunks that we've analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.
To do this, I am going to take John Major's advice and use an SQL database. We have a MySQL database rather than a PostgreSQL database, so I couldn't just copy his hints, but I've played with doing the same thing with MySQL today.
To map each chunk to coordinates in each species in the alignment, I am making a table per species. This will hold the chunk id together with the chunk locus (chromosome, start index and end index).
Explicit region representation
To try it out, I first made a table that explicitly stores this information:
CREATE TABLE chunks_humans (
chunk INT PRIMARY KEY,
chromosome VARCHAR(2) NOT NULL,
chunk_start INT NOT NULL,
chunk_end INT NOT NULL,
INDEX (chromosome, chunk_start, chunk_end)
);
and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:
import MySQLdb
import random
import time
import sys
_CONN = MySQLdb.connect(host='localhost',
user='mailund',
db='geo_test')
_CURSOR = _CONN.cursor()
_CHROMOSOMES = map(str, range(1,23)) + ['X']
def random_region(n):
'''Generate a random region of max length "n".'''
c = random.choice(_CHROMOSOMES)
start = random.randint(1,10000000)
end = start+random.randint(1,n)
return c, start, end
def populate(noRows):
'''Insert "noRows" random regions into the database.'''
for i in xrange(noRows):
c, start, end = random_region(10000)
_CURSOR.execute('''INSERT INTO chunks_humans
(chunk, chromosome, chunk_start, chunk_end)
VALUES (%s, %s, %s, %s)
''',
(i, c, start, end))
_CONN.commit()
def nuke():
'''Delete all the rows, to make ready for another experiment...'''
_CURSOR.execute('DELETE FROM chunks_humans;')
_CONN.commit()
def query(c, start, end):
'''Look up all chunks overlapping (c, start, end).'''
_CURSOR.execute('''SELECT * from chunks_humans
WHERE chromosome = "%s"
AND chunk_start < = %d
AND chunk_end >= %d
'''
% (c, end, start))
for hit in _CURSOR.fetchall():
yield hit
def random_queries(n):
'''Run "n" random queries...'''
for i in xrange(n):
c, start, end = random_region(100000)
for hit in query(c, start, end):
pass # don't do anything... just time it...
nuke()
populate(10000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
then = time.time()
random_queries(n)
now = time.time()
print n, now-then
sys.stdout.flush()
The results came out as this, when I ran it on my laptop:
Spatial queries
Next, I tried the spatial queries suggested by John.
The idea is to store the loci as 2D boxes and use MySQL's geometric data structures to speed up the query. If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups. (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database...)
Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.
The table is defined as:
CREATE TABLE chunks_humans (
chunk INT PRIMARY KEY,
locus GEOMETRY NOT NULL,
SPATIAL INDEX (locus)
);
and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):
import MySQLdb
import random
import time
import sys
_CONN = MySQLdb.connect(host='localhost',
user='mailund',
db='geo_test')
_CURSOR = _CONN.cursor()
_CHROMOSOMES = map(str, range(1,23)) + ['X']
_CHRMAP = dict((c,i) for i,c in enumerate(_CHROMOSOMES))
def random_region(n):
'''Generate a random region of max length "n".'''
c = random.choice(_CHROMOSOMES)
start = random.randint(1,10000000)
end = start+random.randint(1,n)
return c, start, end
def region2polygon(c, start, end):
'''Translates a region into a MySQL polygon...'''
c_as_y = _CHRMAP[c]
return "GeomFromText('POLYGON((%d %f, %d %f, %d %f, %d %f, %d %f))')" \
% (start, c_as_y,
end, c_as_y,
end, c_as_y+.1,
start, c_as_y+.1,
start, c_as_y)
def populate(noRows):
'''Insert "noRows" random regions into the database.'''
for i in xrange(noRows):
c, start, end = random_region(10000)
c_as_y = _CHRMAP[c]
_CURSOR.execute('''INSERT INTO chunks_humans
(chunk, locus)
VALUES (%s, %s)
''' %
(i, region2polygon(c, start, end)))
_CONN.commit()
def nuke():
'''Delete all the rows, to make ready for another experiment...'''
_CURSOR.execute('DELETE FROM chunks_humans;')
_CONN.commit()
def query(c, start, end):
'''Look up all chunks overlapping (c, start, end).'''
_CURSOR.execute('''SELECT * from chunks_humans
WHERE Intersects(locus, %s)
'''
% region2polygon(c, end, start))
for hit in _CURSOR.fetchall():
yield hit
def random_queries(n):
'''Run "n" random queries...'''
for i in xrange(n):
c, start, end = random_region(100000)
for hit in query(c, start, end):
pass # don't do anything... just time it...
nuke()
populate(1000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
then = time.time()
random_queries(n)
now = time.time()
print n, now-then
sys.stdout.flush()
The time for this is much better than the explicit representation:
Especially considering that the number of entries in the database doesn't seem to affect the query time much. This is really good news, considering that I have lots of entries.
Getting genome coordinates back from the polygons?
Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon. I haven't figured that out, and I cannot find anything in the documentation about it...
Maybe I'll just have to store the explicit information in the table, together with the polygon and only use the latter for queries... does anyone know how I can get it from the polygon?
--
22-38=-26


January 22nd, 2009 at 10:22 pm
Thanks for the testing...this will most definately come in handy.
January 22nd, 2009 at 10:23 pm
great post and nice hack !
January 23rd, 2009 at 5:04 pm
FYI...as far as 'getting genome coordinates back from the polygons'...MySQL provides a set of functions to convert the geometry format. Namely AsText(). No need to be redundant with data storage...
http://dev.mysql.com/doc/refman/5.0/en/functions-to-convert-geometries-between-formats.html
Hope this helps.
January 23rd, 2009 at 5:09 pm
Yeah, I know about AsText(), but I don't really want to have to parse the text output...
For points, there is functions like X() and Y() that I can use, but I couldn't find any functions for getting the points of the bounding box of an object :(
January 25th, 2009 at 4:42 am
I'm glad this worked out for you, and happy to see the MYSQL spatial package works well!
As far as extracting the start/end/chromosmeID... I've generally stored the chromosome, start, and end positions in addition to a spatial entry, just to simplify getting at those coordinates when I need them. The start and end positions probably map directly to the polygon x(start) x(end) positions, but then converting the Yrange to a chromosome name is an extra step. I just prefer to have the fields in my result set.
One word of caution. In some implementations of the spatial queries, when you ask for overlapping features, this might include features that touch, but do not technically overlap. Make sure to double check how MYSQL handles these cases.
John
January 25th, 2009 at 6:49 am
Thanks, I'll keep that in mind.
I've also given up on extracting coordinates from the polygons and store the indices directly
January 25th, 2009 at 12:08 pm
[...] my recent post on spatial SQL queries, I included plots showing the results of my performace tests. I wanted to show the results, and a [...]
January 30th, 2009 at 8:50 am
Re: extracting coordiantes.
As far as I can read you can convert the stores geomtery back to text using "Astext()", e.g.
SELECT AsTexty(g) FROM geom;
...if you want to save space...
http://dev.mysql.com/doc/refman/5.1/en/functions-to-convert-geometries-between-formats.html#function_astext
January 30th, 2009 at 8:59 am
Palle: Yes I know, and I commented on that above :)
I just don't want to have to parse the text representation of a polygon later on...
January 30th, 2009 at 12:38 pm
Arhhhh.... well, you know reading 15 lines up can be a very hard task... ;)
March 25th, 2009 at 7:48 pm
[...] I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and [...]