Debugging fun

Yesterday I got a simulator from Garrett Hellenthal that I need to simulate some genetics data (combined “panel” and “tagSNP” data, if you want to know).

The program didn’t work for me. On the example data provided with the code, it complained about invalid input. Right! Naturally, I expected that Garrett had given me test data that was in some way outdated. It happens quie often that the test data isn’t updated when new features are added to a program, and that test examples no longer function the way they should. I make that mistake a lot, anyway.

So I got hold of Garrett, and yes the test example was incorrect and he sent me an updated test. It didn’t work either. We emailed back and forth a bit, and this time it was a bit of a puzzle. Everything worked fine at Garrett’s end, but the program just wouldn’t eat the input on my computer. I didn’t have the time to figure out why, yesterday, as I had a plane to catch, but I had another look at it today.

I first grep’ed for the error message I got: “CV frequencies must be between 0 and 0.50″. That line occurred a few places in the code, so I annotated the code with some output before each and that pinpointed this test as the problem:

if ((allelefreq[count] < epsilon[count]) || (allelefreq[count] > (0.50-epsilon[count]))) {
   printf("CV frequencies must be between 0 and 0.50\n");
   exit(1);
 }

Ok, so on Garrett’s machine, the test was false and on mine it was true, but the input files were the same. First I tested if the input was read in correctly (Garrett’s IO code is a bit of a mess, excuse me for saying it). allelefreq[count] was supposed to be 0.4 — and it was — and epsilon[count] should be 0.1 — and it was.

That’s when I spotted the problem. Do you see it?

Clearly 0.4 ≥ 0.1 and 0.4 ≤ 0.5-0.1 so the test should evaluate to false, but it doesn’t.

The thing is, floating point numbers aren’t real numbers (as you probably know). And for one thing, 0.1 cannot be represented in a finite number of bits (as a binary number) so it is approximated in the computer. So 0.5-0.1 is not 0.4 (although they could be represented by the same bit-patter as 0.4 is represented as). Whether 0.4 ≤ 0.5-0.1 or 0.4 > 0.5-0.1 on the computer depends on the way the numbers are represented.

Garrett was running his program on a 64 bit architecture (where the numbers were represented using 128 bits) and I on a 32 bit architecture (with 64 bits per number). That made all the difference.

Fun, eh?

Tags: , ,

8 Responses to “Debugging fun”

  1. Reed A. Cartwright Says:

    I had a very strange bug in one of my programs recently. After a couple of simple code changes GSL started to fail when I tried to generate a multinomial. After a while, I eventually determined that my rand number object was being cleared earlier in the code. I traced it to an unrelated vector::push_back command, which had been fine before. I then went back to commenting out several recent changes and found that a “generate” command was doing “something” so that when I pushed the generated object onto the vector, it erased what I thought was an unrelated random number object.

    As soon as I changed the generate line a bit, everything was fine. It’s the weirdest bug I’ve ever had.

  2. Thomas Mailund Says:

    What was it, Reed, do you think? Some trashing of uninitialized memory or such? I have struggled with such problems countless of times… valgrind usually is a life saver, but you don’t want to run a long MCMC in valgrind, ’cause that will never finish.

    As for GSL, I had a problem with that yesterday, but that was (of course) my own fault. I had a global variable for the random number generator. I usually keep that global, but probably should use a Singleton pattern instead. Anyway, the global variable is one of the last thing I initialise before I run my main code. I was re-factoring HapCluster yesterday to move a lot of the logic into two classes (a Chain and a State class for the MCMC) to make it easier to customise the chain to the different updates needed for different kinds of input data. Now the problem was that I need to initialise these two classes early in the setup code and State objects need random numbers in the initialisation. Of course, before I had initialised the random number generator. It took me a little while to figure out I was using uninitialised memory to generate random numbers…

  3. Mikaelhc Says:

    Hi,

    Just out of curiosity: On which platform (and using which C++ type) did you find 128-bit floats? Isn’t this quite exotic? (I believe most (all?) 64-bit platforms still use 8 bytes for doubles)

  4. Reed A. Cartwright Says:

    I think generate was making a temporary copy of my global random number object and when the temp copy of the object was destroyed (in connection with the push_back) command, the GSL rng that both wrapped was freed. If I’d written a proper copy constructor/operator it probably wouldn’t have happened.

    I solved the issue by passing a functional wrapper object to generate instead of the actual rng.

  5. Thomas Mailund Says:

    Michael: Actually, I was just guessing about the size of doubles. If they are not of different size than on my box, I don’t know why it worked for Garrett and not me

  6. Mikaelhc Says:

    It is probably related to the internal rounding when doing floating point math.

    Since 0.5 may be exact represented, and both 0.1 and 0.4 have simple repeating bit patterns (e.g. 0.4 decimal = 0.0110011001100…. and 0.1 is the same shifted two positions) it should not matter at all how many bits you use to represent them. But the rounding will be important (it is stated in ‘float.h’).

    Actually since Intel FPU’s all use an internal 80-bit float representation, the compiler may choose to keep some numbers in registers at a higher resolution until the result is written back to memory – further complicating matters.

    But anyway, consider adding a threshold when comparing floats :-)

  7. Thomas Mailund Says:

    Michael, I never compare floats without a threshold, but then, I rarely compare them…

    Thanks for the info on Intel, though. Does it explain why it worked for Garrett and not me, though?

  8. Mikaelhc Says:

    Well, it is only a guess.

    There must be something different between yours and Garrett’s machine.

    If both your machines are using Intel CPU’s, it might be the case that the 64-bit machine defaults to using the fast SSE registers, which are 64-bit wide. This could be checked on a 32-bit machine/OS by compiling with “-mfpmath=sse” (for GCC) to see if the behavior changes. He might also have more aggressive optimization applied which turns off rounding (not very likely though).

    All guesses though, but let us know if you find out what causes the difference!

Leave a Reply