Archive for April, 2009

Last week in the blogs…

Tuesday, April 21st, 2009

Oh man, these regular summaries are slipping … it used to be Sunday, then it was Monday, now it is Tuesday.  I hope I can catch up with it next week, but the last week was just crazy with work, and it isn’t exactly slowing down this week…

Anyway, her goes.

Association mapping

Computing

Evolution

Genetics

Programming

Research Life

111-130=-19

Crazy busy week

Friday, April 17th, 2009

Oh man, this week has been too busy even for my taste.

I really was supposed to be working on a project that I am going to London to finish in a week, but a genome project got in the way.

We are involved in a few genome sequencing and analysis project at BiRC where we use our CoalHMM methodology to infer speciation times and ancestral effective population sizes.

The method uses changes in genealogy topology to infer this, but essentially ignores the variation in coalescence times along the genomes and just uses the mean sequence divergence.

Back when I was writing about how the coalescence process leads to such variation in divergence, see:

I also developed an approximation to the coalescence with recombination process that lets us estimate the distribution of coalescence times and fragment lengths.  I’ll write about how it works later, I promise, but it takes a bit explaining so I need to find a day where I’m not over worked, so it will be some upcoming vacation or something…

Anyway, because we can compute these distributions we can also construct a hidden Markov model that uses them in its transition matrix, and over Easter, Julien implemented this HMM.  Our initial simulation-based validation is very promising, so now we are analysing a few genomes.

The method is relatively fast, so we can analyse a genome in about a CPU week on our Linux grid.  Still, because we are adding a new method and completely new results very late in a project, we need to move fast to get the results into the paper.  Since I am away in the UK at the end of the month, and Julien is away for a week of well deserved holiday and then job hunting in France, we need some convinsing results this week already.  Thus a very busy week.

That is why the blog has been very quiet lately.  No excuse, of course, but an explanation.

Ok, off I go again.  I need to finish making slides for a talk I’m giving this afternoon (on genomic analysis of abes and the issues mentioned above…)

107-130=-23

Cool diff display

Monday, April 13th, 2009

I’m used to comparing different revisions of my source code through diff, but XCode has a really cool way of displaying the changes between two revisions.  Pick SCM > Compare With and you get something like this:

It is just the same info as in a diff file, of course, but I find it much easier to see the changes this way.

103-129=-26

Last week in the blogs

Monday, April 13th, 2009

Another week has passed, so here’s my list of my favorite blog posts in it:

Genetics and genomics

Research life

…don’t really know how to categorize this one, but it is definitely worth a read!

103-128=-25

Numerical problems with hidden Markov models

Sunday, April 12th, 2009

Ok, so the problem I was working on a few days ago, where I needed interval arithmetic, concerns hidden Markov models.  We use these for genome analysis because they are pretty fast and thus scale to analysing whole genomes.  We still need to split the genomes into segments to parallelise our analysis, but still, it is pretty fast; at least a lot faster than MCMC methods or such.

Anyway, when we split the genomes into segments there is a tradeoff.  The larger segments, the more data we have for each analysis job, but of course each job also takes longer to run.

It is not the only trade off, though.  There is also a numerical issue: yes, the longer the segments the more observations we have, but the HMM algorithms have to do more arithmetic operations on the data, and numerical errors tend to accumulate with those.

I have no idea about how fast this happens, but I set out to find out, so I implemented the Forward algorithm with a parameterised type, so I could chuck in Boost’s interval type (with either floats, doubles or long doubles as the underlying type).

I use the “scaling” method for dealing with underflows (if you implement the Forward algorithm directly “from the book” you get an underflow from multiplying numbers between 0 and 1 pretty damn fast).  That means that in each column in the forward table, I calculate the column sum, call that sum the “scale” for that column, and then normalise all the values in the column so the column sums to 1.

As far as I understand, that approach is more numerical stable than working in log space, where you have to use various tricks to handle sums.  It is certainly a lot faster than translating numbers to and from log space all the time.

Anyway, I ran the forward algorithm on a 5 state HMM with an alphabet size of 4, with a random transition and emission matrix.  For each column in the table I can then look at the width of the intervals in the interval representation of my numbers.  The intervals are the intervals where the true value, had we infinite size floats to work with, would be.  The wider the intervals, the less certain we are about the true type if we just used plain floating point arithmetic.

First, I looked at the “scale” values.  The results are show below (notice that the y-axis is in log scale):

Initially the intervals are small, but then they just explode.  Not good for trusting the results; if the interval of possible values ranges over orders of magnitude, the true number can be just about anything! Depending on the underlying floating point type, the intervals explode early or late, but even for long doubles the explosion happens before 100 columns.  So the numerical accuracy can only be trusted for sequences shorter than 100!

Damn!

I also looked at the smallest and largest interval in each column.  The plot is below (shortest intervals in green, largest in red):

We see the explosion again, so it is pretty hard to distinguish between the smallest and largest intervals, so I have also plotted the interval widths for the prefixes before the explosion:

Before the explosion, it seems the nummerical error accumulates linearly (as we probably would expect).  There doesn’t seem to be much difference between the widest and the smallest intervals, though.

I expected something like the last plot for the entire range, I must admit.  Of course, I know that the accuracy is only good up to a point, because at some point you start adding very large numbers to very small (between 0 and 1) numbers, so after that point you just stick to the large number, but still…

I didn’t expect the explosion in inaccuracy that early.  Dealing with sequences of length less than 100 is just not feasible, and certainly not what anyone is doing.

Of course, the intervals here are worst case behaviour.  They are inclusive, so they capture all possible values that the true value could be in.  They say nothing about how close the values we get from working with finite size floating point values are to the true real value numbers.  That is probably what saves us in practise.

Still, I think I need to look into this in more details.  Who knows, there might even be a paper in this study somewhere…

102-127=-25