Archive for January, 2009

This week in the blogs

Sunday, January 25th, 2009

Well, everyone else seems to summarise the posts they found interesting during the week, so it is only fair that I get to as well.  Even with my new year resolution of posting on average a post per day, I cannot cover all the posts I find interesting, so it also gives me an opportunity to simply list a lot of links and perhaps group related posts so you have a chance of reading them together.

In this first installation, though, I’m going to go back a little further this month as well, though, since I collected a few interesting links there. Anyway, here goes:

Genetics

  1. Sequences from first settlers reveal rapid evolution in Icelandic mtDNA pool (PLoS Genetics)
    1. Genetic variation in space & time – Iceland (Gene Expression)
    2. The genetic history of Iceland (Genetic Future)
    3. Ancient DNA analysis of the Icelandic settlers (Me!)
    4. Genetic drift eliminated rare mtDNA haplotypes from Iceland (John Hawks)
    5. mtDNA selection in Iceland? (John Hawks)
  2. Pervasive Hitchhiking at coding and regulatory sites in humans (PLoS Genetics)
    1. Humans have adapted on genome-wide level? (Gene Expression)
    2. How much selection is going on in humans? (Me!)
  3. A genome-wide genetic signature of Jewish ancestry perfectly separates individuals with and without full Jewish ancestry in a large random sample of European Americans (Genome Biology)
    1. How Ashkenazi Jewish are you? (Gene Expression)
    2. Another paper on Ashkenazi Jewish distinctiveness (Dienekes)

Sequences and alignments

  1. Phylogenetic inference under recombination using Bayesian stochastic topology selection (Bioinformatics)
    1. Phylogenetic inference under recombination using Bayesian stochastic topology selection (Me!)
  2. The experts agree (Finchtalk)

Programming

  1. Dynamic languages: Not just for scripting any more (CIO)
  2. Emacs 23 (emacs-fu)

Teaching

  1. Making classes interactive: better learning or just more fun? (Discovering Biology in a Digital World)
  2. TeacherTube: YouTube for teachers (Discovering Biology in a Digital World)
  3. Students know what physicists believe, but they don’t agree: A study using the CLASS survey (Phys. Rev. ST Phys. Educ.)
    1. Students know what physicists belive, but they don’t agree (Uncertain Principles)

Peer reviewing

  1. How are the mighty fallen (Michael Nielsen)
  2. Three myths about scientific peer review (Michael Nielsen)

25-40=-15

Looking for new hardware…

Friday, January 23rd, 2009

After my laptop broke in Beijing, I’m looking for a new one.  This time it is going to be a mac (probably a macbook, since I don’t need the graphics capabilities of a macbook pro and I feel that the macbook air — while cool looking — has lousy specs compared to the macbook).

At the office I have two desktop computers.  An iMac that I use most of the time, for all my writing and lots of my data analysis, and a Linux box for all my heavy computations.

I write software on both.  I have essentially the same development tools on both (the setup is gcc based on both Linux and OS X) except for Xcode that I’ve had my problems with but am slowly getting used to.

Anyway, I’m thinking about getting a similar setup at home.  It is only the battery that is broken on my laptop, so I’ve essentially reduced it to a desktop machine, but it is still usable.  Not really a number cruncher as the box at the office, but I have a cluster and some scripts to deal with that…

Maybe I’ll get both an iMac and a laptop for home.  With the award I won last year, I’ve got some money to burn…

In any case, I’m getting a mac laptop.  Since I’ve started using Keynote to make presentations, I’m finding OpenOffice’s Impress too primitive, and I’m getting tired of having to convert my presentations to PDF to be able to use them when giving a talk.  Especially since it means that I have to prepare talks well in advance, and I’m unable to change them when I’m on the move.

23-39=-16

Playing with spatial queries in MySQL

Thursday, January 22nd, 2009

I’m working on a genome analysis project where we need to correlate a number of genome features.  We have a lot of alignment chunks that we’ve analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.

To do this, I am going to take John Major’s advice and use an SQL database.  We have a MySQL database rather than a PostgreSQL database, so I couldn’t just copy his hints, but I’ve played with doing the same thing with MySQL today.

To map each chunk to coordinates in each species in the alignment, I am making a table per species.  This will hold the chunk id together with the chunk locus (chromosome, start index and end index).

Explicit region representation

To try it out, I first made a table that explicitly stores this information:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       chromosome VARCHAR(2) NOT NULL,
       chunk_start INT NOT NULL,
       chunk_end INT NOT NULL,
       INDEX (chromosome, chunk_start, chunk_end)
       );

and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, chromosome, chunk_start, chunk_end)
                           VALUES (%s, %s, %s, %s)
                        ''',
                        (i, c, start, end))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE chromosome = "%s"
                         AND chunk_start < = %d
                         AND chunk_end >= %d
                    '''
                    % (c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(10000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The results came out as this, when I ran it on my laptop:

Spatial queries

Next, I tried the spatial queries suggested by John.

The idea is to store the loci as 2D boxes and use MySQL’s geometric data structures to speed up the query.  If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups.  (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database…)

Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.

The table is defined as:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       locus GEOMETRY NOT NULL,
       SPATIAL INDEX (locus)
       );

and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']
_CHRMAP = dict((c,i) for i,c in enumerate(_CHROMOSOMES))

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def region2polygon(c, start, end):
    '''Translates a region into a MySQL polygon...'''
    c_as_y = _CHRMAP[c]
    return "GeomFromText('POLYGON((%d %f, %d %f, %d %f, %d %f, %d %f))')" \
            % (start, c_as_y,
               end, c_as_y,
               end, c_as_y+.1,
               start, c_as_y+.1,
               start, c_as_y)

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        c_as_y = _CHRMAP[c]
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, locus)
                           VALUES (%s, %s)
                        ''' %
                        (i, region2polygon(c, start, end)))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE Intersects(locus, %s)
                    '''
                    % region2polygon(c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(1000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The time for this is much better than the explicit representation:

Especially considering that the number of entries in the database doesn’t seem to affect the query time much.  This is really good news, considering that I have lots of entries.

Getting genome coordinates back from the polygons?

Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon.  I haven’t figured that out, and I cannot find anything in the documentation about it…

Maybe I’ll just have to store the explicit information in the table, together with the polygon and only use the latter for queries… does anyone know how I can get it from the polygon?

22-38=-26

Trying out Google Reader

Wednesday, January 21st, 2009

Until now, I’ve used Evolution (on Linux) or Mail (on OSX) to read both mails and RSS feeds, but that means I’ve had to keep my list of RSS feeds synchronized on two machines, which is a bit of a drag.

Now that my Linux laptop is essentially a desktop machine (the battery is no longer working) I am planning on buying a new laptop, so I’ll have three machines where I’ll regularly read mails and RSS feeds.

Mails is not a problem — I don’t change mail servers regularly and I don’t add to the list of servers on a weekly basis — but the RSS feeds is.

So now I am considering switching to Google Reader.

Thanks Anders for the video.

So far, it looks like a fine reader to me, and with google gears it integrates nicely with Firefox.

I’m not getting the full benefit of it, ’cause I don’t get much benefit out of the “sharing” feature.  There’s only one in my friends list who uses Google Reader, so that’s not much fun.  I should get some more friends, I guess ;-)

My next project is figuring out FriendFeed…

21-37=-16

Why you need typename

Tuesday, January 20th, 2009

One of our PhD students had problems with C++ today and asked me for help.

He had a piece of code that, for this post, can be reduced to this:

#include <vector>

template <typename T>
class C {
    void foo(std::vector<t>::iterator i);
};

If you are an experienced C++ programmer, it doesn’t take you long to see the problem here, but if you are not, this piece of code looks perfectly valid and the error message you get from g++

test.cc:6: error: ‘class std::vector<T, std::allocator<_CharT> >::iterator’ is not a type
test.cc:7: error: expected unqualified-id at end of input

is of no help at all.

Why isn’t vector<T>::iterator a type? It always is!

Well, the thing is, that is not actually guaranteed.

Another example that shows the problem might make it clearer.

template <typename T>
struct C {
    int p; 

    void foo()
    {
	T::t * p;
    }
};

In foo, are we defining p as a pointer to a T::t, or are we just multiplying a static member T::t with the instance variable p?

You cannot know that before you know T, and to get around that problem, the C++ standard requires that you make types explicit using the typename keyword.  If you don’t, it will assume that you mean a static member.  So you should change the example to this:

template <typename T>
struct C {
    int p; 

    void foo()
    {
	typename T::t * p;
    }
};

What about the vector example, then?  Surely a vector<T>::iterator is always a type?

No, actually not.  You can always specialize a template like vector<T> and for example say that a vector of class C has a static member named iterator, instead of a typedef.

Think about vector<bool>.  That is a specialization, and is different from what the generic vector<T> would be, if it had been instantiated with T=bool.

Until the point where a template is instantiated, you really know very little about it, so you need to help the compiler.

The fix to the original problem is just adding the typename keyword:

#include <vector>

template <typename T>
class C {
    void foo(typename std::vector<t>::iterator i);
};

20-36=-16