Mouse genome completed (again?)

When I looked up some genome sequencing statistics a little while back, I was surprised to learn that only the human and the mouse genome is considered “completed”.  Maybe the mouse genome wasn’t anyway, ’cause this month there’s a PLoS Biology paper on the completed genome.

Church DM, Goodstadt L, Hillier LW, Zody MC, Goldstein S, et al. (2009) Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112

Oh well, in any case the distinction between a draft and complete genome sequence is a bit artificial anyway.  It is really only a matter of degree.  The “complete” human genome sequence still has a lot of gaps and is still being updated (UCSC’s genome browser updated hg18 to hg19 in February this year), and in any case a genome sequence is really only a reference sequence and a lot of the interesting stuff is in genomic variation, so is a genome really “complete” before we have at least a map of the common variations?

147-159=-12

Numerical problems with hidden Markov models

Ok, so the problem I was working on a few days ago, where I needed interval arithmetic, concerns hidden Markov models.  We use these for genome analysis because they are pretty fast and thus scale to analysing whole genomes.  We still need to split the genomes into segments to parallelise our analysis, but still, it is pretty fast; at least a lot faster than MCMC methods or such.

Anyway, when we split the genomes into segments there is a tradeoff.  The larger segments, the more data we have for each analysis job, but of course each job also takes longer to run.

It is not the only trade off, though.  There is also a numerical issue: yes, the longer the segments the more observations we have, but the HMM algorithms have to do more arithmetic operations on the data, and numerical errors tend to accumulate with those.

I have no idea about how fast this happens, but I set out to find out, so I implemented the Forward algorithm with a parameterised type, so I could chuck in Boost’s interval type (with either floats, doubles or long doubles as the underlying type).

I use the “scaling” method for dealing with underflows (if you implement the Forward algorithm directly “from the book” you get an underflow from multiplying numbers between 0 and 1 pretty damn fast).  That means that in each column in the forward table, I calculate the column sum, call that sum the “scale” for that column, and then normalise all the values in the column so the column sums to 1.

As far as I understand, that approach is more numerical stable than working in log space, where you have to use various tricks to handle sums.  It is certainly a lot faster than translating numbers to and from log space all the time.

Anyway, I ran the forward algorithm on a 5 state HMM with an alphabet size of 4, with a random transition and emission matrix.  For each column in the table I can then look at the width of the intervals in the interval representation of my numbers.  The intervals are the intervals where the true value, had we infinite size floats to work with, would be.  The wider the intervals, the less certain we are about the true type if we just used plain floating point arithmetic.

First, I looked at the “scale” values.  The results are show below (notice that the y-axis is in log scale):

Initially the intervals are small, but then they just explode.  Not good for trusting the results; if the interval of possible values ranges over orders of magnitude, the true number can be just about anything! Depending on the underlying floating point type, the intervals explode early or late, but even for long doubles the explosion happens before 100 columns.  So the numerical accuracy can only be trusted for sequences shorter than 100!

Damn!

I also looked at the smallest and largest interval in each column.  The plot is below (shortest intervals in green, largest in red):

We see the explosion again, so it is pretty hard to distinguish between the smallest and largest intervals, so I have also plotted the interval widths for the prefixes before the explosion:

Before the explosion, it seems the nummerical error accumulates linearly (as we probably would expect).  There doesn’t seem to be much difference between the widest and the smallest intervals, though.

I expected something like the last plot for the entire range, I must admit.  Of course, I know that the accuracy is only good up to a point, because at some point you start adding very large numbers to very small (between 0 and 1) numbers, so after that point you just stick to the large number, but still…

I didn’t expect the explosion in inaccuracy that early.  Dealing with sequences of length less than 100 is just not feasible, and certainly not what anyone is doing.

Of course, the intervals here are worst case behaviour.  They are inclusive, so they capture all possible values that the true value could be in.  They say nothing about how close the values we get from working with finite size floating point values are to the true real value numbers.  That is probably what saves us in practise.

Still, I think I need to look into this in more details.  Who knows, there might even be a paper in this study somewhere…

102-127=-25

Last week in the blogs

Another week, another list of favorite blog posts…

Fun

Genetics

Genome analysis

High performance computing

Programming

Reviewing

89-111=-22

Building a database of CoalHMM results

In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees.  We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.

To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.

One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider.  Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.

Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns.  With gigs of rows for both kinds of coordinate systems.

So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns.  The trick is now to have fast queries on the genomic coordinates to get the alignment columns.

I’ve been playing around with this, this afternoon.

Query times without indexing

With just plain tables, with no indexing on the columns, we have the worst case behaviour.  Before trying out any optimisations, I just wanted to figure out how slow queries would be.

So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.

I then timed the queries for random chunks of 1K contiguous positions.  The results are plotted below:

The time doesn’t seem to depend much on the chromosome sizes, but depends significantly on the alignment size.

To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that.  The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.

Query times with simple indexing

Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.

Results below:

This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.

Query times with spatial queries

So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).

This time, the results came out as:

The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments.  Even if there is some dependency, I can still be luck and have a query time of fractions of a second.

Sweet.

Remaining problems

The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:

This means that some of the logic in the database is not actually stored in the database, but in scripts accessing.  Especially annoying since I plan to access it from both Python and R (and possibly C++ where I’m thinking about writing a Qt application for interactive viewing of results).

That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.

Is this something that can be done with MySQL procedures?  I have never used those, and am really a novice when it comes to databases, so I really have no idea…

84-104=-20