Be careful with your types!
Sunday, September 21st, 2008I’ve spent the last two hours debugging scripts only to find that the error wasn’t in the scripts but in my analysis of the result…
I’m scanning a genome alignment for informative indels (indels where exactly two of four share a gap that starts and stops at the same position). My scripts find the position of each of those and outputs it together with the two species having the gap: HC for human and chimp sharing the gap, HO for human and orangutan sharing the gap, HM for human and macaque sharing the gap, etc.
Now, in most of my analysis of this I do not want to distingush between which pair has the gap and which does not, I am only interested in the quartet topology (HC|OM vs. HO|CM for example). So I want to re-map the pairs CM to HO, CO to HM and OM to HC.
I do my analysis in R, and this is how I did the re-mapping:
data$pair <- factor(sapply(data$pair,
switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))
The result is not what you’d expect:
> table(data$pair) CM CO HC HM HO OM 705 336 30377 646 349 13089 > data$pair <- factor(sapply(data$pair, + switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC")) > table(data$pair) HC HM HO 13794 30726 982
There is clearly something wrong in the mapping. The total number is correct, but HC now is not the sum of the earlier HC and CM!
This is how I should have done it:
data$pair <- factor(sapply(as.character(data$pair),
switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))
Do you notice the difference?
The type of data$pair is factor so it is encoded as levels (numbers 1 to 6). The switch function uses these leves as if they were integers, and use them to index into the list HC=”HC”, …, OM=”HC”.
If data$pair contained strings then switch would match them against the names in that list, but when it is integers it doesn’t.
The type really matters here.
> data$pair <- factor(sapply(as.character(data$pair), + switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC")) > table(data$pair) HC HM HO 43466 982 1054


