Posts Tagged ‘genome analysis’

Everything you know is wrong

Wednesday, March 25th, 2009

A post at Discovering Biology in a Digital World reminded me of this Weird Al song:

Not that the post is a parody or anything.  It is actually quite serious.

It lists a few of the things we “knew” about the genome, that turned out to be wrong once we improved our measurement technology, such as the ubiquity of structural variation or RNA genes.

84-101=-17

Playing with spatial queries in MySQL

Thursday, January 22nd, 2009

I’m working on a genome analysis project where we need to correlate a number of genome features.  We have a lot of alignment chunks that we’ve analysed with our CoalHMM method, and I need a fast way of looking up genomic positions for each chunk and combine these with other genomic features.

To do this, I am going to take John Major’s advice and use an SQL database.  We have a MySQL database rather than a PostgreSQL database, so I couldn’t just copy his hints, but I’ve played with doing the same thing with MySQL today.

To map each chunk to coordinates in each species in the alignment, I am making a table per species.  This will hold the chunk id together with the chunk locus (chromosome, start index and end index).

Explicit region representation

To try it out, I first made a table that explicitly stores this information:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       chromosome VARCHAR(2) NOT NULL,
       chunk_start INT NOT NULL,
       chunk_end INT NOT NULL,
       INDEX (chromosome, chunk_start, chunk_end)
       );

and to test the performance I populated it with random chunks and timed the query time for varying number of queries using this Python script:

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, chromosome, chunk_start, chunk_end)
                           VALUES (%s, %s, %s, %s)
                        ''',
                        (i, c, start, end))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE chromosome = "%s"
                         AND chunk_start < = %d
                         AND chunk_end >= %d
                    '''
                    % (c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(10000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The results came out as this, when I ran it on my laptop:

Spatial queries

Next, I tried the spatial queries suggested by John.

The idea is to store the loci as 2D boxes and use MySQL’s geometric data structures to speed up the query.  If it knows that we are going to search for overlaps of regions, it can index the relevant data for faster look ups.  (If you know about geometric algorithms you know how this works, but it is a pleasant experience not having to implement R trees and such yourself but just rely on a database…)

Anyway, I represent genome regions as boxes where the x-coordinate is the base-pair coordinates and the y-axis is the chromosomes.

The table is defined as:

CREATE TABLE chunks_humans (
       chunk INT PRIMARY KEY,
       locus GEOMETRY NOT NULL,
       SPATIAL INDEX (locus)
       );

and the queries are based on intersections of the query box and the region boxes in the database (see populate and query below):

import MySQLdb
import random
import time
import sys

_CONN = MySQLdb.connect(host='localhost',
                        user='mailund',
                        db='geo_test')
_CURSOR = _CONN.cursor()

_CHROMOSOMES = map(str, range(1,23)) + ['X']
_CHRMAP = dict((c,i) for i,c in enumerate(_CHROMOSOMES))

def random_region(n):
    '''Generate a random region of max length "n".'''
    c = random.choice(_CHROMOSOMES)
    start = random.randint(1,10000000)
    end = start+random.randint(1,n)
    return c, start, end

def region2polygon(c, start, end):
    '''Translates a region into a MySQL polygon...'''
    c_as_y = _CHRMAP[c]
    return "GeomFromText('POLYGON((%d %f, %d %f, %d %f, %d %f, %d %f))')" \
            % (start, c_as_y,
               end, c_as_y,
               end, c_as_y+.1,
               start, c_as_y+.1,
               start, c_as_y)

def populate(noRows):
    '''Insert "noRows" random regions into the database.'''
    for i in xrange(noRows):
        c, start, end = random_region(10000)
        c_as_y = _CHRMAP[c]
        _CURSOR.execute('''INSERT INTO chunks_humans
                           (chunk, locus)
                           VALUES (%s, %s)
                        ''' %
                        (i, region2polygon(c, start, end)))
    _CONN.commit()

def nuke():
    '''Delete all the rows, to make ready for another experiment...'''
    _CURSOR.execute('DELETE FROM chunks_humans;')
    _CONN.commit()

def query(c, start, end):
    '''Look up all chunks overlapping (c, start, end).'''
    _CURSOR.execute('''SELECT * from chunks_humans
                       WHERE Intersects(locus, %s)
                    '''
                    % region2polygon(c, end, start))
    for hit in _CURSOR.fetchall():
        yield hit

def random_queries(n):
    '''Run "n" random queries...'''
    for i in xrange(n):
        c, start, end = random_region(100000)
        for hit in query(c, start, end):
            pass                # don't do anything... just time it...

nuke()
populate(1000)
print 'no.queries time'
for n in xrange(500, 5500, 500):
    then = time.time()
    random_queries(n)
    now = time.time()
    print n, now-then
    sys.stdout.flush()

The time for this is much better than the explicit representation:

Especially considering that the number of entries in the database doesn’t seem to affect the query time much.  This is really good news, considering that I have lots of entries.

Getting genome coordinates back from the polygons?

Now the only problem I have left to solve is how to get the chromosome (y-coordinate) and the basepair coordinates (x-coordinates) back out of a MySQL polygon.  I haven’t figured that out, and I cannot find anything in the documentation about it…

Maybe I’ll just have to store the explicit information in the table, together with the polygon and only use the latter for queries… does anyone know how I can get it from the polygon?

22-38=-26

Photo session

Thursday, December 4th, 2008

Hi, sorry I haven’t written anything in ages.  I’ve been completely swamped in work on a genome project, but one I don’t think I can write about until the paper is out, and since I haven’t thought about much else the last couple of weeks I’ve been silent here.

Don’t worry, I’ll tell all about it when we are getting closer to publication.

Anyway, a few days ago I had a photo session related to the project (and a previous project) and you can see some of the pics below.

(Photographs by Jesper Voldgaard).

As you can no doubt guess from the pictures, some abes are involved in the project…

Framework for whole genome sequence analysis

Friday, October 10th, 2008

I’m working on a project right now where I am analysing a whole genome alignment of human, chimp, orangutan and macaque.

I’ve downloaded the alignment from the UCSC genome browser, it is in the MAF format, and I use the bx-python module from Penn State for the analysis. That is a very nice module, by the way.

For scanning through the alignment and collecting various statistics, MAF+bx-python is a fine combination, but now I have some interesting regions I want to analyse in more depth, and then the combination is not quite good enough.

The MAF files are essentially sequential chunks of alignments, and to find a particular region in the genome, I need to scan through a large file until I find the region.

I have solved this problem by extracting the regions and put them in a number of different files, that I can then easier access.  But wouldn’t it be nice if I could just seek() to the relevant position?  In any of the species coordinates just to make it even nicer?

The bx-python module lets me extract sub-alignments, so

s = a.slice(begin,end)

extracts the sub-alignment of a beginning at column begin and ending at end-1.

(I can also get a sub-set of species, if you are wondering, but the method is slightly different for that).

I would love to be able to get a sub-alignment based on species coordinates instead of column numbers, so

s = a.slice(begin,end,coordinates='hg18')

would give me the alignment that maches the region [begin,end[ in humans (hg18 assembly).  With a proper data representation, it should also give me this without scanning linearly through the file…

Does anyone know about a multiple alignment representation that lets me look up slices in files without scanning through all the alignment leading up to that location?

Surely genome browsers do not scan linearly through genomes to display their information — it would never be fast enough…

What do people use?

A short introduction to the Human Genome

Tuesday, October 7th, 2008

At BiRC we have a small study group where we are reading Michael Lynch’s book, The Origins of Genome Architecture.  We take turns presenting a chapter, and last time it was my turn and the chaper (chapter 3) is on the Human Genome.

Below are my notes.  I’ve tried to translate basepairs and percentages into meters, ’cause it helps me visualise the relative sizes.  Even though it still boggles my mind a bit that the human genome is about 2m if you stretch it out.  Well, that can’t be helped.  Three billion is a lot, even if it is three billion very tiny things…

The human genome

The human genome is about 3Gbp which is about 2m if you stretch it out.  Of this, about 1%, or 2cm, is coding. That doesn’t mean that only 1% is genes, ’cause most of genes (even protein coding genes) are not actually coding.  We’ll get to that later.

There is nothing unusual about the human genome.  Not compared to other multicelular eukariotes, at least.  It is not the largest or the smallest and it does not have an unual number of genes.  In fact, it is common as mud.

The genes are grouped into families, where the families are phylgenetically related.  Some people have argued that the number of genes in a family tells you something about the importance of the family, but the distribution of family sizes could just as easily be explained by a stochastic birth/death process, at least if the gene families have different birth/death ratios.  (This begs the question of why they should have that, but the chapter doesn’t say and I don’t know).

Introns and exons

The protein coding genes consist of introns and exons.  The exons are what is left after the transcript is spliced and the introns is what is thrown out.

A personal comment: remember that exon is not synonymous with coding.  I’ve seen people confused about this (and made the mistake myself in a script or two).  The coding sequence starts somewhere along the sequence of exons and stops somewhere before the exons end. The coding sequence is a sub-sequence of the exons.

Back to the chapter…

The average exon length is about 0.15kbp while the average intron length is about 4.66kbp. So introns are on average about 40 times as long as exons.  If we imagine (not to scale this time) that the exons are 1 cm long, then we have 40 cm of introns between them.

That is a lot of intron…

Since complex organisms doesn’t seem to have dramatically more coding genes than simpler organisms, alternative splicing has been suggested to explain the complexity.

On average we have 2.6 different splice products per gene.  As far as we know, there is less alternative splicing in C. elegans and D. melanogaster, but then we know a lot more about alternative splicing in humans and mice that we do in any other organism, so we are boundt to have seen more rare splice variants here and the extra splicing we are seeing might easily be a selection bias.

We don’t really know how much of the alternative splicing is functionally important and how much is random “noise”.

Regulatory DNA

How much regulatory DNA do we have?  Based on conserved regions in the genome (which is probably a very conservative estimate) we have a few percent of the genome.

It seems, however, that the fraction of the genome that is conserved increases with organism complexity, so perhaps complexity is all in regulation?

Everything and the kitchen sink seems to be transcribed, and most of it differentially expressed, but how much of this is spurious we do not know.

We do know about different non-coding RNA genes, such as miRNA.  Although miRNA is only found in multi-cellular organisms, RNA interference is found in all domains of life.

Even accounting for regulatory DNA, 95% of the genome — 1.9 m — has no function that we know of.  Know of being the key word here, of course.  Don’t call it junk before you know that it is really doing nothing…

Mobile elements

There are about 100 times as many mobile elements as there are coding genes, and 75% — 1.5 meter — of the genome is a product of some mobile element or other.

(I have a lot of notes about the different kinds, but quite frankly I find it a bit dull, so I am not going to mention it here…)

Pseudogenes

There are about half as many pseudogenes as coding genes.

There are various mechanisms for introducing pseudogenes in the genome, including re-insertion of cDNA, tandem duplications and just plain inactivation of a gene.

The first two cases are likely to introduce pseudogenes “dead on arrival”.  A re-inserted cDNA won’t have the regulatory mechanisms to get transcribed, and a duplication is likely to disrupt it as well.

Pseudogenes seem to have a halflife of ~37MYA (halflife of the time it takes till we don’t recognize it as a copy of another gene) compared to a halflife of ~7.5MYA for active duplicated genes.  This seems to indicate that there isn’t much selection working against dead-on-arrival genes compared to duplicating an active gene and thus potentially doubling its product.

Human evolution

The chapter closes with an analysis of the human lineage and our evolution.  (Can we finally find something that makes us special?)

It seems that we have seen an increase in substitution rate on our line, but Lynch argues that what we are seeing is not so much adaptive selection but rather a reduced negative selection.

There can be different explanations for this.  A reduced population size can have reduced the effects of selection. A change in behaviour could have changed our fitness landscape, so traits that would normally be selected strongly against changing are now free to vary.