Playing with spatial queries in MySQL

I’m working on a genome analysis project where we need to correlate a number of genome features.  We have a lot of alignment chunks that we’ve analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.

To do this, I am going to take John Major’s advice and use an SQL database.  We have a MySQL database rather than a PostgreSQL database, so I couldn’t just copy his hints, but I’ve played with doing the same thing with MySQL today.

To map each chunk to coordinates in each species in the alignment, I am making a table per species.  This will hold the chunk id together with the chunk locus (chromosome, start index and end index).

Explicit region representation

To try it out, I first made a table that explicitly stores this information:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       chromosome VARCHAR(2) NOT NULL,
       chunk_start INT NOT NULL,
       chunk_end INT NOT NULL,
       INDEX (chromosome, chunk_start, chunk_end)
       );

and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:

import MySQLdb
import random
import time
import sys
 
_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()
 
_CHROMOSOMES = map(str, range(1,23)) + ['X']
 
def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end
 
def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, chromosome, chunk_start, chunk_end)
                           VALUES (%s, %s, %s, %s)
                        ''',
                        (i, c, start, end))
    _CONN.commit()
 
def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()
 
def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE chromosome = "%s"
                         AND chunk_start < = %d
                         AND chunk_end >= %d
                    '''
                    % (c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit
 
def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...
 
nuke()
populate(10000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The results came out as this, when I ran it on my laptop:

Spatial queries

Next, I tried the spatial queries suggested by John.

The idea is to store the loci as 2D boxes and use MySQL’s geometric data structures to speed up the query.  If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups.  (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database…)

Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.

The table is defined as:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       locus GEOMETRY NOT NULL,
       SPATIAL INDEX (locus)
       );

and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):

import MySQLdb
import random
import time
import sys
 
_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()
 
_CHROMOSOMES = map(str, range(1,23)) + ['X']
_CHRMAP = dict((c,i) for i,c in enumerate(_CHROMOSOMES))
 
def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end
 
def region2polygon(c, start, end):
    '''Translates a region into a MySQL polygon...'''
    c_as_y = _CHRMAP[c]
    return "GeomFromText('POLYGON((%d %f, %d %f, %d %f, %d %f, %d %f))')" \
            % (start, c_as_y,
               end, c_as_y,
               end, c_as_y+.1,
               start, c_as_y+.1,
               start, c_as_y)
 
def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        c_as_y = _CHRMAP[c]
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, locus)
                           VALUES (%s, %s)
                        ''' %
                        (i, region2polygon(c, start, end)))
    _CONN.commit()
 
def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()
 
def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE Intersects(locus, %s)
                    '''
                    % region2polygon(c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit
 
def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...
 
nuke()
populate(1000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The time for this is much better than the explicit representation:

Especially considering that the number of entries in the database doesn’t seem to affect the query time much.  This is really good news, considering that I have lots of entries.

Getting genome coordinates back from the polygons?

Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon.  I haven’t figured that out, and I cannot find anything in the documentation about it…

Maybe I’ll just have to store the explicit information in the table, together with the polygon and only use the latter for queries… does anyone know how I can get it from the polygon?

22-38=-26