Posts Tagged ‘Python’

Longest ordered subsequence

Friday, March 13th, 2009

If you are into string algorithms (and typical bioinformatics applications) you will love these two posts:

Yeah, I know it doesn’t look like bioinformatics at first sight, but it is very similar to the kind of string algorithms we use there.

It is great posts with a nice mix of coding and algorithmcs.

72-90=-18

This week in the blogs

Sunday, February 8th, 2009

It is Sunday, so it is time for me to list the blog posts I found interesting this week.

This week I’ve had a bit of a fight with creationism, so I’ve read some blog posts on that, but I am not going to dignify the discussion with linking to them here.

Instead it will just be the usual list of programming and bioinformatics links.

Enjoy!

Bioinformatics

Programming

Statistics

Blogging

39-59=-20

Python threads

Monday, February 2nd, 2009

Jesse Noller has an excellent post on Python Threads and the Global Interpreter Lock that is well worth a read.

I, in particular, liked the use of decorators to get the “synchronized” feature known from Java and the long discussion about the GIL.

I haven’t used threads much in my own Python code — I usually split my scripts in separate processes to parallelise them on our cluster — so there were a lot in this post I didn’t know about.

33-53=-20

Playing with spatial queries in MySQL

Thursday, January 22nd, 2009

I’m working on a genome analysis project where we need to correlate a number of genome features.  We have a lot of alignment chunks that we’ve analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.

To do this, I am going to take John Major’s advice and use an SQL database.  We have a MySQL database rather than a PostgreSQL database, so I couldn’t just copy his hints, but I’ve played with doing the same thing with MySQL today.

To map each chunk to coordinates in each species in the alignment, I am making a table per species.  This will hold the chunk id together with the chunk locus (chromosome, start index and end index).

Explicit region representation

To try it out, I first made a table that explicitly stores this information:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       chromosome VARCHAR(2) NOT NULL,
       chunk_start INT NOT NULL,
       chunk_end INT NOT NULL,
       INDEX (chromosome, chunk_start, chunk_end)
       );

and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, chromosome, chunk_start, chunk_end)
                           VALUES (%s, %s, %s, %s)
                        ''',
                        (i, c, start, end))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE chromosome = "%s"
                         AND chunk_start < = %d
                         AND chunk_end >= %d
                    '''
                    % (c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(10000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The results came out as this, when I ran it on my laptop:

Spatial queries

Next, I tried the spatial queries suggested by John.

The idea is to store the loci as 2D boxes and use MySQL’s geometric data structures to speed up the query.  If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups.  (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database…)

Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.

The table is defined as:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       locus GEOMETRY NOT NULL,
       SPATIAL INDEX (locus)
       );

and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']
_CHRMAP = dict((c,i) for i,c in enumerate(_CHROMOSOMES))

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def region2polygon(c, start, end):
    '''Translates a region into a MySQL polygon...'''
    c_as_y = _CHRMAP[c]
    return "GeomFromText('POLYGON((%d %f, %d %f, %d %f, %d %f, %d %f))')" \
            % (start, c_as_y,
               end, c_as_y,
               end, c_as_y+.1,
               start, c_as_y+.1,
               start, c_as_y)

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        c_as_y = _CHRMAP[c]
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, locus)
                           VALUES (%s, %s)
                        ''' %
                        (i, region2polygon(c, start, end)))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE Intersects(locus, %s)
                    '''
                    % region2polygon(c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(1000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The time for this is much better than the explicit representation:

Especially considering that the number of entries in the database doesn’t seem to affect the query time much.  This is really good news, considering that I have lots of entries.

Getting genome coordinates back from the polygons?

Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon.  I haven’t figured that out, and I cannot find anything in the documentation about it…

Maybe I’ll just have to store the explicit information in the table, together with the polygon and only use the latter for queries… does anyone know how I can get it from the polygon?

22-38=-26

Yet another Newick parser

Monday, January 19th, 2009

Whenever I try out a new framework for writing parsers, I always write a Newick parser.  It is the “Hello, World!” of parsers.

A little while back, I wrote a Newick parser for C++ using Boost.Spirit, and last month I needed to update my Python parser for Newick, so I tried out the Toy Parser Generator framework.

Some students had a problem with my old parser.  It wouldn’t accept un-quoted labels with names starting with a number, such as 12b_foobar.  Those labels are, strictly speaking, perfectly valid in the Newick format, but my parser would break the label into two tokens, a number, 12, and a label b_foobar, and there were no easy way to fix that the way the parser is designed.

So I had an opportunity to try out one of the many Python parser frameworks.

I picked the Toy Parser Generator becaused it looked to be very easy to use, and it was.

You write a parser by writing a class that inherits from tpg.Parser, and you then write your tokenizer and parser as the documentation string of this class.  So, to ignore white space and consider every string not containing “^”, “,”, “:”, “;”, “(” and “)” a label — as per the Newick standard — you write the tokenizer like this:

class Parser(tpg.Parser):
    r"""
    separator space	'\s+' ;
    token label		'[^,:;() \t\n]+' ;
    ...
    """

Similarly, you write the parser in the documentation string using essentially a BNF grammar, but annotated with the actions you want the parser to take when it has parsed a given part of the input.

For example, if we say that a Newick tree consists of a “sub-tree” followed by “;” where a “sub-tree” can be either a leaf or an inner node, we write:

class Parser(tpg.Parser):
    r"""
    ...

    START/tree ->
    	Subtree/tree ';'
    ;
    Subtree/tree ->
    	Internal/tree
      | Leaf/tree
    ;

    ..."""

Here START is the root of the grammar — and will be the result of the parser call — while Subtree, Internal and Leaf are non-terminals in the grammar.

The name after the slash in START/tree, Subtree/tree, Internal/tree and Leaf/tree is an identifier used for the actions.

The name after the non-terminal before the “->” will be the result of parsing the sub-expression, so START will result in tree and so will Subtree, where tree is whatever the Python variable tree contains.

On the right hand of the “->” the names after the (non-)terminals will be the result of parsing a sub-expression, so for the START rule, START will return tree that is whatever the Subtree sub-expression returns.

For the Subtree rule, the parser will again return tree, but tree is either the result of parsing an Internal rule or parsing a Leaf rule; the result of both will be named tree (because of the Internal/tree and Leaf/tree).

For a more interesting example, we can look at the way we handle nodes in the tree:

class Parser(tpg.Parser):
    r"""
    ...
    Leaf/leaf ->
    	Name/name                        $ leaf = Leaf(name)
    ;
    Internal/tree ->
    	'\(' BranchList/st '\)' Name/n   $ tree = Internal(n,st)
    ;
    ..."""

A Leaf is just a Name, where Name is a rule I’ve defined elsewhere. A Name will return a value, name, (it will be a string, but that is only because Name is defined that way).  I translate that value into a Leaf using the expression after the $ where I set the value for “leaf” which is what the Leaf rule will return.

Internal will return a tree — named “tree” — and that consists of a list of branches in parenthesis followed by a name (the name of the inner node — I have actually never seen Newick trees with named inner nodes, but it is in the standard).  The list of branches is parsed using the BranchList rule whos result is stored in the “st” variable (st for sub-trees).  The Name is stored in the “n” variable, and the result of Internal, “tree” is set in the expression after $.

All in all, it is a very easy way to write a parser.  Easier even that Boost.Spirit — that is also a very nice framework — and much easier than Yacc/Lex.

It doesn’t help me at all with the problem I had with labels starting with numbers, though.

If you first tokenize and then parse, there is no easy way around that.  The string 12b_foobar really is two tokens, a number 12 and a string b_foobar.  Maybe with a framework that would go for the longest legal token first or something like that would help, but that isn’t the case with this framework and I am not sure what other consequences that would have…

Anyway, I solved the problem by not solving it in the framework at all.  I just consider both labels and numbers as “labels” and handle the “labels” that should really be numbers — the branch lengths — in the expressions in the parser:

class Parser(tpg.Parser):
    r"""
    ...
    Name/name ->
    	label/name
      | Empty               $ name = ""
    ;

    Length/length ->
    	':' Number/length
      | Empty               $ length = 0
    ;

    Number/n ->
        label/l	            $ n = float(l)
    ;
    ..."""

The complete code for the parser can be downloaded here.

19-32=-13