Maybe I’m not crazy after all…

This evening I was reading in Pattern Recognition and Machine Learning, the book we use in our machine learning class.  We only use the first half of the book, but we are thinking about extending the class to cover two terms and then cover the entire book (or most of it, anyway) so I figured this was a good excuse to actually read the whole book.  So far, I’ve only read the chapters we actually use, plus a few pages here and there.

Anyway, I was reading chapter 6, on kernel methods, but I got stuck on the first figure.

It is supposed to illustrate kernel functions k(x,x’) as linear combinations of feature functions: k(x,x’)=Σφi(xi(x’). The top row shows the feature functions, φi(x), and the bottom row the kernel function, as a function of x with x’ fixed at 0.

That doesn’t make any sense at all to me.

On the left-most figure, the feature functions are all 0 for x’=0, so the kernel function is a sum of zeroes.  It should be constant zero, not the curvy blue line.

For the other two, the feature functions are all non-negative, so how can the kernel function ever be negative?  A product of non-negative values cannot be negative, and neither can the sum of non-negative numbers.

In short, the figure is all wrong.  There isn’t a single thing right about it.

That was my reasoning, in any case, but I wasn’t completely sure.  I could be missing something.

So I googled for the book, but then I found powerpoint presentations including the figure, with no mentioning of any errors.  Clearly someone was using the figure in their teaching, so maybe it wasn’t wrong after all.

It got me nervous.  I feel that I really need to understand something to teach it, so I expect other people to feel the same way, and someone had used this figure.

I am not mentioning names here, ’cause as you have probably guessed the figure is wrong.  There is nothing wrong with my reasoning above.

Well, another minutes Googling found me the errata list, and sure enough, the figure is fixed there.

I’m happy to find that I hadn’t completely misunderstood the topic and that I was right about the figure.

I am a little disappointed that a teacher would use the figure without at least checking that the figure actually makes sense.  Showing an example that makes no sense at all is doing a lot of harm to the students…

Be careful with your types!

I’ve spent the last two hours debugging scripts only to find that the error wasn’t in the scripts but in my analysis of the result…

I’m scanning a genome alignment for informative indels (indels where exactly two of four share a gap that starts and stops at the same position).  My scripts find the position of each of those and outputs it together with the two species having the gap: HC for human and chimp sharing the gap, HO for human and orangutan sharing the gap, HM for human and macaque sharing the gap, etc.

Now, in most of my analysis of this I do not want to distingush between which pair has the gap and which does not, I am only interested in the quartet topology (HC|OM vs. HO|CM for example). So I want to re-map the pairs CM to HO, CO to HM and OM to HC.

I do my analysis in R, and this is how I did the re-mapping:

data$pair <- factor(sapply(data$pair,
                           switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))

The result is not what you’d expect:

> table(data$pair)

   CM    CO    HC    HM    HO    OM
  705   336 30377   646   349 13089
> data$pair <- factor(sapply(data$pair,
+                            switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))
> table(data$pair)

   HC    HM    HO
13794 30726   982

There is clearly something wrong in the mapping.  The total number is correct, but HC now is not the sum of the earlier HC and CM!

This is how I should have done it:

data$pair <- factor(sapply(as.character(data$pair),
                           switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))

Do you notice the difference?

The type of data$pair is factor so it is encoded as levels (numbers 1 to 6).  The switch function uses these leves as if they were integers, and use them to index into the list HC=”HC”, …, OM=”HC”.

If data$pair contained strings then switch would match them against the names in that list, but when it is integers it doesn’t.

The type really matters here.

> data$pair <- factor(sapply(as.character(data$pair),
+                            switch, HC="HC", HO="HO", HM="HM", CM="HO",CO="HM",OM="HC"))
> table(data$pair)

   HC    HM    HO
43466   982  1054