Debugging fun
Saturday, May 10th, 2008Yesterday I got a simulator from Garrett Hellenthal that I need to simulate some genetics data (combined "panel" and "tagSNP" data, if you want to know).
The program didn't work for me. On the example data provided with the code, it complained about invalid input. Right! Naturally, I expected that Garrett had given me test data that was in some way outdated. It happens quie often that the test data isn't updated when new features are added to a program, and that test examples no longer function the way they should. I make that mistake a lot, anyway.
So I got hold of Garrett, and yes the test example was incorrect and he sent me an updated test. It didn't work either. We emailed back and forth a bit, and this time it was a bit of a puzzle. Everything worked fine at Garrett's end, but the program just wouldn't eat the input on my computer. I didn't have the time to figure out why, yesterday, as I had a plane to catch, but I had another look at it today.
I first grep'ed for the error message I got: "CV frequencies must be between 0 and 0.50". That line occurred a few places in the code, so I annotated the code with some output before each and that pinpointed this test as the problem:
if ((allelefreq[count] < epsilon[count]) || (allelefreq[count] > (0.50-epsilon[count]))) {
printf("CV frequencies must be between 0 and 0.50\n");
exit(1);
}
Ok, so on Garrett's machine, the test was false and on mine it was true, but the input files were the same. First I tested if the input was read in correctly (Garrett's IO code is a bit of a mess, excuse me for saying it). allelefreq[count] was supposed to be 0.4 -- and it was -- and epsilon[count] should be 0.1 -- and it was.
That's when I spotted the problem. Do you see it?
Clearly 0.4 ≥ 0.1 and 0.4 ≤ 0.5-0.1 so the test should evaluate to false, but it doesn't.
The thing is, floating point numbers aren't real numbers (as you probably know). And for one thing, 0.1 cannot be represented in a finite number of bits (as a binary number) so it is approximated in the computer. So 0.5-0.1 is not 0.4 (although they could be represented by the same bit-patter as 0.4 is represented as). Whether 0.4 ≤ 0.5-0.1 or 0.4 > 0.5-0.1 on the computer depends on the way the numbers are represented.
Garrett was running his program on a 64 bit architecture (where the numbers were represented using 128 bits) and I on a 32 bit architecture (with 64 bits per number). That made all the difference.
Fun, eh?