# Ok, maybe I’m just in a foul mood

Ok, about my previous post… I was ranting a bit there.  I totally mean what I wrote, it is not that, but maybe I went a bit over the top.

I just got a teaching evaluation back from a class last term, and that was pretty bad.

Only five students actually filled out the evaluation, so I probably shouldn’t read too much into it, but still… I’m pretty pissed off about it.

After taking a teaching course, I tried engaging the students more in the lectures.  I would ask questions during the the lectures and actually – and this is what the complain about – would stop the lecture and wait for someone to come up with an answer.

Now they complain that I waste time on this.  That it is too hard, because even if the questions are some they should know they still need to be able to prepare for them, etc.

Yes, you should prepare for it.  If you show up unprepared for the lectures, what do you expect to get out of them?

It seems to me that the more work I put into actually teaching – I mean actually trying to get the students to understand the subject – the more grief I get.  My evaluation when I just showed a few slides with this and that and told a joke was so much better.

I’m just not cut out to be a teacher after all…

182-184=-2

# Well that didn’t take long…

Already we are getting the first complaints about the exams.  The exams we finished today.

I have never received any complaints about exams in computer science, biology or statistics, and I have taught or censored plenty of those.

Molecular medicine is different, it seems.

I got two formal complaints last term, where I tought a class for those students, and we are already getting complaints now about the exams I censored this week.

What is it with these people? Is the only acceptable grade the top grade?

Objectively, if you look at the distribution of grades, something is wrong.  We have given way too many top grades.  But now we are getting complaints from the students that didn’t get the top grade.  They feel their grade is unfair.

Now, of course that could be true.  An oral exam does have the risk of being somewhat subjective. We could be treating the students unfairly.  Still, we try very very hard not to.  We make plenty of notes during the exams.  We compare their presentation against the learning objectives.  We have a list of important and less important points that they need to get right and we base our grade on those.

They clearly don’t trust us to be able to make that evaluation, though.

If they feel the exam went well, it must have gone well.  If we give them a low grade, something must be wrong.

After each exam, we spend 2-3 minutes evaluating the exam with the student.  Telling her what was good and what was bad.  That used to be enough, but with this batch, it clearly isn’t. They are still going to question everything we do, ’cause if they don’t get the grade they wont, clearly we must have made some mistake.  They surely didn’t.

I’m fed up with this.

Next time, maybe multiple choice exams is the way to go.  Those are terrible at actually evaluating what the students can do – abover pre-structural knowledge about the topics – but if they really want sub-optimal teaching I sure as hell can give it to them.

182-183=-1

# What is a p-value

One thing that shocked me in the last three days exams was the students’ understanding of p-values.  Not that all of them misunderstood them, not by far, but some had a very flawed understanding, and the mind really boggles at how they can think what they do and still use p-values the way they do…

I’m not really a fan of p-values myself, for the reasons I wrote about in great details before, but p-values are probably here to stay so people really need to understand this if they want to do any kind of statistics!

### What is a p-value

It can be a bit tricky to get the definition right, so I’ll just quote Mathew Stephens:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

So if you have a stochastic variable $$X$$, distributed as $$P(X)$$ under the null hypothesis, and if you observe the value $$x$$, then $$p=P(X\geq x)$$ (this is a one sided p-value just for convenience).

One of the reasons it is a bit tricky is that it is not itself a random variable.  Once you have observed $$x$$ it is fixed.  It just tells you how likely it is that you observe something more extreme than what you actually observed.

If you repeat the experiment and observe another $$x$$ then that would have another p-value, so in that sense there is some stochastic behavior, but only in that sense.

Actually, if you repeat the experiment lots of times, and all the observations are actually from the null distribution, then the p-values will be uniformly distributed between 0 and 1.  This is why, if we have a significance threshold of $$\alpha$$ we expect that a fraction of $$\alpha$$ false positives will be observed.  That is, if we sample $$N$$ values from the null distribution, we will get $$\alpha\cdot N$$ significant observations.  By chance.  They are false positives since we consider them significantly different from what we “would expect” even if they behave exactly as expected.

### What is a p-value not

A p-value is not the probability that the null hypothesis is true.  It really isn’t.

The p-values are uniformly distributed if we sample values from the null distribution, so how can it be?  All p-values are exactly equally likely under the null hypothesis.  It is not any more likely to get a p-value of 0.99 than it is of getting a p-value of 0.01, if the observations are really from the null distribution.

If you think that this distinction on what a p-value really is, is somewhat technical and not really that important, then let me ask you this: if the p-value is really the probability that the null hypothesis is true then why do we typically only reject the null hypothesis if the p-value is below 5%?

If the p-value was really that, wouldn’t we go for all p-values below 50%?

Those would be the observations where it is more likely that the observation is from the alternative distribution than from the null distribution.  If you had to make a bet on whether the observation is from the null distribution or the alternative, you would be wrong 95% of the time if you only picked those where the p-value is below 0.05!

This is where the mind boggles.

If you really believe that the p-value is the probability that the null is true, then how can you go ahead and only bet on that when the p-value is below 0.05?

It is not some subtle definition here, what you are doing if you really believe that the p-value is the probability that the null model is true is just plain stupid.  You are making bets against common sense.

True, if somehow false positives are much more expensive than false negatives you might have a point here, but if not you are just making a lot more mistakes than you could be making.

If you pick the significant values from the 5% significance threshold, and p-values really are the probability that the null is false, then you would end up with 95% true positives among those you select.

If that is what you are aiming for, then what you are doing makes sense. It is not what you will get – because p-values are just not what you think they are – but at least it makes sense. It does mean that only 5% of those values you pick would be false positives, though, so in the downstream analysis you shouldn’t explain away more than 5% as “probably false positives”.  That would be inconsistent.

By now, if a little late in the post, I should probably say that this confusion came up in the context of genome wide association studies.  Here you expect that by far the most genetic variation has no relation to any given phenotype, so if you look at genotypic variation and its association with any given phenotype, you would expect very few variations to be true positives.

If, in this context, you use a 5% significance threshold, you should – again, assuming your understanding of p-values is correct – find 95% true positives and 5% false positives.

If that is true, would you do a multiple test correction?

Wouldn’t it be just fine if you had 95% true positives and 5% false positives?

Would you really try to explain away the wast majority of your “hits” as false positives if you only expect 5% of them really to be false positives?

It seems a bit inconsistent to me…

### p-values and evidence

If you ever hear yourself saying that low p-values are “more significant” than high p-values, stop yourself!

Not that it is wrong, as such, but here the argument really is more subtle.

Under the null hypothesis, any p-value is exactly equally likely.  They are uniformly distributed.

Under the null hypotheis, it is exactly as likely to observe a p-value of 0.99 as a p-value of $$10^{-99}$$.

If you say “more significant” you are just wrong.  A p-value is either significant or not, since it boils down to whether it si under the significance threshold or not.

If you say that it is more likely to be a true positive than a false positive, you might be right, but that has to do with the distribution of p-values under the alternative hypothesis.

The reason we look at p-values in the first place is because we think that extreme values of $$x$$ are more likely under the alternative distribution than under the null distribution.

If something is unlikely under the null distribution, it might be less unlikely under the alternative distribution.  That is why we are interested in small p-values.  It has nothing to do with how likely or unlikely they are under the null distribution.

### A quick comment on p-values and sample size

Your power to detect true positives over false positives is related to sample size, in the sense that the more data you have the more likely it is that you can detect differences between the null and the alternative distribution.

But in what way?

There seems to be some confusion here as well.

If we use the right definition of p-values, then you would expect 5% false positives if you use a threshold of 5%.  That is what a threshold of 5% means.

If you increase your sample size, would that reduce the number of false positives?  No, it wouldn’t!  Regardless of the sample size, the threshold is chosen such that you expect 5% false positives.  The actual threshold values will chance, but not the fact that you expect 5% false positives.  That is what the 5% significance threshold is.  It is its being; its purpose in life; what it really is.

By increasing the sample size, the best you can hope for is that more true positives makes it below the threshold.  So among the positives you will get a larger fraction of true positives over false positives.  The absolute number of false positives cannot change unless you change the threshold.

If you sample values from the null distribution, and you use a 5% significance threshold, you get 5% false positives.

The sample size really only matters if you consider a mixture of the null distribution and the alternative distribution.

How values are distributed under such a mixture depends on the sample size, and the two distributions are easier to distinguish between the larger the sample size (hopefully, at least).  You will always get 5% of the null distribution values with a 5% threshold, but you might get more and more of the alternative distribution values if you increase the sample size.

### A quicker comment on p-values and sample size

This is a bit of a trickier point.

In real life, you would not actually expect the same number of false positives if you increase the sample size.

Yes, mathematically, if you have a 5% threshold you would get 5% false positives regardless of the sample size, because 5% of the samples from the null distribution would fall below the threshold.

In real life, the number of false positives would increase if you increase the sample size.

What???

What happens is this: In the real world, no simple null hypothesis is true.  The world is a lot more complicated than simple statistical models.

If, in an association study, your null hypothesis is that there is no genetic difference between cases and controls, you are really saying that there is no genetic difference between cases and controls.  But there will be!  If you sample at random, that is true, but you probably don’t.  There will be subtle differences from sources unrelated to the disease.

Let me give you a simpler example.  Consider tossing a die.  The null hypothesis would be that all six sides are equally likely.

If we have two dice, one that is loaded and one that is not, we could be looking for values from the one that is loaded.

If we throw each die once and record the result, we wouldn’t be able to distinguish between them.  I mean, if one is a five and the other is three, which do you think is loaded?

If, on the other hand, we throw them 100 times and notice that one of them hits 6 half of the time, we would be pretty sure that that is the loaded die.

Now, if we throw the dice a milion times, and test if each side is equally likely, chances are that we would conclude that both the dice are loaded.  Because the “fair” die is actually loaded, just less so than the “loaded” die.

The “one” side is heavier than the “six” side, because how the eyes are drilled.  That will show up in a milion throws of the die.  Probalby not in 100 throws, but with enough throws it will.

The mathematical model of a die does not match reality.  The null model is not true.

The null model is what we test for, and unless it is exactly true, we will reject it with enough samples.

As we increase the sample size, we will get more and more false positives.

The only reason that increasing sample size is sensible at all is because the distribution of the values under the alternative hypothesis moves away from the theoretical null distribution faster than those that are closer to the mathematical null distribution.

With increased sample size, the fraction of true positives over false positives will improve – up to a point – but the absolute number of false positives will actually improve.

In most cases not something to worry about, but with very large sample sizes and simple models you probably should.

I have rejected a few papers based on this, actually, where the result is rejecting a simple linear model based on an enormous sample size.  Such results are really to be expected, just by chance, because the simple models are never true…

182-182=0

# Off to exams again…

In half an hour I’m off to the last day of exams in medical genome analysis, and my last day of exams in this period.

It is twenty minutes oral exams where each student presents a project she did during the class, after which we shoot some questions about the project.  This is pretty typical for how we run exams here at AU.

I am having some problems with this particular class, though.  I am only censoring, and the two teachers on the class are running the actual examination, so I am not the only one with the problem.

The problem is, it is very hard to grade the students.  The distribution of grades right now, two thirds into the exams, is this:

I keep track of the grades, the time each student actually gets, the time of day, and lots of other info in a spreadsheet – plus of course have comments on the exam in text files – just because I’m a data junkie (and a little bit in case anyone complains about the exam later on).

The distribution looks a bit like a mixture, where you either get 12 – the top grade – or a normal distribution around seven.  The latter is how it is supposed to look, really. Like this one, from string algorithms I taught last term:

UB means that the student didn’t show up, and 0 that they failed, so those should not be normal distributed… the rest look okay.

I’m not sure that it is such a mixture, really, what is going on, though.  I think it is more that the 10s end up in 7 or 12 – most likely the latter – and what we really have is a distribution with a mode at 12 and that then drops off as the grades gets lower.

Now, there is not an inherent problem in giving too many top grades.  I don’t think any of the grades are incorrect when comparing the exam performance against the learning goals of the class, and that is of course what we should grade against.

It just looks like the requirements for the exams are such that it is too hard to graduate the best of the class.  There is essentially no differentiation between the good and the very best.  We should be able to differentiate between them.  We need that differentiation to pick the post grad students out of the pack.  Here, we just cannot, ’cause as long as they meet all the course requirement they should get the top grade.

We’ve talked a lot about this the last two days.  We cannot change the requirements now, of course, but something must be done before the next class.

It is just not obvious what.  The learning goals really do match what we want them to learn.  Maybe it is just the examination that must be changed, so we have a better way of testing how deep an understanding they actually have on the subject.

I have no idea how to do that, though.  I find doing that a lot easier in mathematics or computer science classes, but this is really a class about the practical problems in medical genomics, and I do not have enough experience in examinations in something like that.