Archive for the ‘Teaching’ Category

P-values again again

Friday, February 4th, 2011

For no apparent good reason, I read an old post on p-values and re-read this comment:

John Larkin Says: 

Hi.sorry. I have trouble with the “if you repeat experiment lots of times…p value…uniformly distributed between 0 and 1″.
Is that true? If you do it lots of times do you get as many grouped around 0.0-0.01 as around 0.49-0.50?
It may be because I’m thinking of “experiments” (e.g. height of groups)…vs some statistical scenario whish uses the word stochastic – which clearly puts me in trouble.
I don’t think of pvalue as direct measure of likelihood of nul hypothesis. But if you compared two big samples (huge!) from two big groups twice (say, of height) and each experiment gave you a p-value of 0.99….I just get the feeling that these two groups might be very similar/same population…..
Cheers
JL

My answer was this

Thomas Mailund Says: 

John: Yes, p-values are uniformly distributed (under the null distribution) so you do expect to observe as many in the interval 0.0-0.01 as in 0.49-0.5.

You cannot consider a p-value of 0.99 as any kind of measure of similarity. It just doesn’t work that way.

The reason we are interested in low p-values is because if we sample from a mixture of the null distribution and the alternative distribution, then we expect more of the alternatives in the low end of p-values than we expect from the null.

Hope that helps.

Now that I think about it, this isn’t strictly true.

I still hold that p-values are uniformly distributed under the null model. So under the null model, you cannot conclude that a high p-value indicates strong support for the null model whereas low p-values support the alternative model. It doesn’t work like that.

But of course, the null model can be wrong in more than one way, and not all will show up as low p-values.

If your null model tells you that there should be a certain variance, and you see less, then you will probably see an excess of high p-values. The observations are more similar than they should be (under the null model).

You won’t see the problem as too many low p-values, but as too many high values.

If the p-values are not uniformly distributed, your null model is wrong. It can be wrong in so many ways that it really doesn’t matter why it is wrong. It is just wrong.

Hope that makes sense.

TED: Teaching math

Thursday, February 3rd, 2011

A little while ago, I voiced my desperation about teaching stats to computer scientists. Or rather the problems of framing this so they would find it interesting.

I went with some problem descriptions and trying to reduce stats to “figuring out what is going on” rather than arithmetics. I don’t think I nailed it that well, but at least I tried.

This TED talk is quite relevant to this:

Math is not arithmetic, but ways of reasoning about the world in a quantitative way.

The introduction to Modern Heuristics, that describes that without a context getting you thinking about the right models, most people cannot solve rather elementary math problems. Not even math professors. Math is about formulating the models, identifying the problems, it is not about doing equations.

I wish I was better at teaching this.

The arithmetic (or in general most of the equation manipulation) is something computers do better. Let’s focus on modeling the world.

Call for help: Teaching statistics for Machine Learning

Thursday, January 20th, 2011

On Monday I start teaching my Machine Learning course again. I’m looking at the material for the first week right now, and I want to change it from last year.

Typically, my students will have had classes on mathematical modeling, a bit of probability theory and a bit of statistics, but experience tells me that they only have a very superficial knowledge about it. They don’t need much more for this class, but I still want to get some key points out regarding the statistics that we will be using in the class, and the last few years I don’t think I managed that well.

I don’t want to focus on modeling so much, and I certainly don’t want to discuss experiment design since the data we look at generally is just collected data that we need to make some kind of sense of, not collected to decide one theory against another.

It really is about a few points: Given the data and some generic model, say a neural network, why do we estimate the parameters in the way we do? What can we say about the accuracy of predictions? That kind of stuff.

I usually go a little bit into Bayesian statistics for model selection, but most of what they see in the class are different generic models that they estimate parameters for through maximum likelihood.

The thing is, while they generally remember how they estimate the parameters in different models when we get to the exam, they focus on the details of a particular model and rarely remember that they are essentially doing the same thing for all the models: maximizing a likelihood in a probabilistic model.

The first couple of years I taught this class, I definitely focused too much on the mathematical details in this. Going through derivations of the math, explaining how you got various posteriors from conjugate priors and such. Major fail.

I tried changing that last year, focusing more on examples, but it didn’t help much once we got to the exam.

Do any of you have experience with teaching statistics core concepts, preferably with some good examples? Care to share?

If you don’t teach this stuff, but have had classes like it, what worked for you as a student and what definitely didn’t work?

Search and replace on fonts in Keynote?

Thursday, May 6th, 2010

I converted some old slides from OpenOffice to Keynote for my lecture this morning.  I had to go through a PowerPoint format for this, since OpenOffice cannot export to Keynote, but Keynote can import PowerPoint, but that wasn’t so much of a problem.

However, the conversion messed up one of the fonts.  In particular, it replaced a mono-space font with a proportional font, which messed up my pseudo-code examples completely.

Changing the font to a mono-space one isn’t much of a problem, except that there are tens of pages where I had to do this.  Plus, I couldn’t simply change it on one page and copy it to the other, since I’m changing the colour of various parts of the text between the pages.  You can see the slides here if you are interested.

So I had to click all the text boxes one at a time, go to the font drop down, and select the new font.  Not only is this rather tedious but for someone with RSI like me it is actually physically painful…

Do any of you Mac folks out there know if there is a way to search and replace a font in Keynote?  If not directly, then how about through Automator or AppleScript?

Testing an O(n log n) running time

Tuesday, May 4th, 2010

This morning I noticed that this blog had been hit by this search term:

I’m not sure why it hit my blog since I have never written about that (as far as I recall), but I know that my current students have to worry about an O(n \log n) running time algorithm for their current project.

They have to implement an algorithm for finding tandem repeats, and the essential part of the algorithm is locating “branching” tandem repeats in time O(n \log n).  To validate that they get it right, I ask them to run some profiling and describe how they test this.  Perhaps it is my students googling to figure out how?

Well, depending on how you think about this, it is either very simple or very hard to validate an O(n \log n) running time.

What does the big-O notation really mean?

When we have an algorithm that runs in time O(n \log n), what we mean is that the worst case running time on input of size n is (eventually) dominated by \alpha \cdot n \log n for some constant \alpha. So if we consider the running time for all input of size n and take the maximum of that, that is the worst case running time.  Let t(n) denote this.  Saying t(n)\in O(n\log n) simply means that there exists an \alpha such that \exists N\,:\,\forall n > N\,:\, t(n) \leq \alpha \cdot n \log n.

Sometimes we don’t want to argue about worst case running time. For example sometimes we worry more about the expected running time, then the t(n) function would take the expectation rather than the maximum over all input of size n.  If this is the case, we make it explicit, so whenever we talk about O(n \log n) without any qualifier, we are talking worst case running time.

By this definition, t(n)\in O(n \log n) doesn’t mean that t(n) is some function t(n) = \beta n \log n or anything like that.  The big-O notation only puts an upper bound on t(n), it doesn’t say that this upper bound is tight or anything.  We have other notations for that, but you don’t see it so much out of algorithmics circles.

Anyway, to simplify matters, I am going to assume for the following that n \log n is a tight bound, and by this I mean that t(n) = \beta \cdot n \log n + o(n \log n) for some constant \beta.  Here the little-O notation means that there might be some extra time consumption in t(n) but that \beta \cdot n \log n - t(n) tends to zero as n tends to infinity.

Can you prove a certain running time with experiments?

For an O(n \log n) algorithm, we mathematically prove that the worst case running time is dominated by some \alpha n \log n.  We don’t experiment our way out of it.  But if we wanted to, could we?

Let’s assume for a second that we don’t have to worry about all input of size n to compute t(n) – and that’s a big assumption in many cases, since many algorithms have different running time depending on what the actual input is, so the running time isn’t really a function of just the size of the input.  If we really did have to worry about this we would need to run our program on all possible input of size n for each n we want to examine, and the combinatorial complexity of this would soon make it impossible to even attempt.

So let’s assume that the running time only depends on the size of the input. Could we then distinguish between, say t(n) = \beta \cdot n \log n and t(n) = \beta \cdot n \log n + \delta \cdot n^2?

If \delta \ll \beta it is pretty hard!

ns <- seq(1,1000)
t1s <- 5 * ns * log(ns)
t2s <- 5 * ns * log(ns) + 1e-4 * ns**2

plot(ns, t1s, type='l', col='blue', xlab='n', ylab='t(n)')
lines(ns, t2s, col='red')

The two are pretty close!  And that is ignoring measurement errors that would muddy the picture further.

Of course, eventually, for some large n, the \delta\cdot n^2 term will dominate the \beta\cdot n \log n term, but for the range you measure it might not.

With your experiments you are best just validating that the algorithm, mathematically proved to be O(n \log n) was implemented correct and appears to be running in time \beta \cdot n \log n (but again remember that this is assuming that t(n) only depends on n and not the actual input and that t(n) = \beta \cdot n \log n + o(n \log n) i.e. that the bound is tight.

Validating an n log n running time

So how do you validate that t(n) = \beta \cdot n \log n?

Well, if t(n) = \beta \cdot n \log n then t(n)/\log n = \beta \cdot n, so just plot t(n)/\log n and check if it looks like a line!

ns <- seq(1,1000)
ts <- 5 * ns * log(ns)

plot(ns, ts / log(ns), type='l', col='blue', xlab='n', ylab='t(n)/log(n)')

Simple, eh?

What about the base of the logarithm?

In the algorithm my students are implementing, the \log n in O(n \log n) isn’t the natural logarithm.  It rarely is in computer science.  It isn’t even the base-2 logarithm as it often is.  The base actually depends on the kind of input (but let’s keep ignoring that).

What do you do when you don’t know the base of the logarithm and you want to validate an O(n \log n) algorithm?  Do you need to divide t(n) with the right base log to get a line?

Of course not.  Remember \log_b x = \frac{\log_k x}{\log_k b}, so if the running time is with a log base b: t(n) = \beta \cdot n \log_b n = \frac{\beta}{\log_k b} \cdot n \log_k n, and you plot it with a log base k, then t(n)/\log_k n = \frac{\beta}{\log_k b} \cdot n is still a line.

ns <- seq(1,1000)
ts <- 5 * ns * log10(ns)                # base 10 logarithm

## plotting with base 2 logarithm
plot(ns, ts / log2(ns), type='l', col='blue', xlab='n', ylab=expression(t(n)/log[2](n)))

That is why, when we write the complexity as O(n \log n) we don’t bother with the base.  All logs are the same under a big-O.

In practice, of course, the base of the logarithm matters as it affects the constant in front of the n \log n, but for the big-O notation it doesn’t, and for validating that your program runs in time O(n\log n) it doesn’t either.