One thing that shocked me in the last three days exams was the students’ understanding of p-values. Not that all of them misunderstood them, not by far, but some had a *very* flawed understanding, and the mind really boggles at how they can think what they do and still use p-values the way they do…

I’m not really a fan of p-values myself, for the reasons I wrote about in great details before, but p-values are probably here to stay so people really *need* to understand this if they want to do any kind of statistics!

### What is a p-value

It can be a bit tricky to get the definition right, so I’ll just quote Mathew Stephens:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

So if you have a stochastic variable $$X$$, distributed as $$P(X)$$ under the null hypothesis, and if you observe the value $$x$$, then $$p=P(X\geq x)$$ (this is a one sided p-value just for convenience).

One of the reasons it is a bit tricky is that it is not itself a random variable. Once you have observed $$x$$ it is fixed. It just tells you how likely it is that you observe something more extreme than what you actually observed.

If you repeat the experiment and observe another $$x$$ then that would have another p-value, so in that sense there is some stochastic behavior, but only in that sense.

Actually, if you repeat the experiment lots of times, and all the observations are actually from the null distribution, then the p-values will be uniformly distributed between 0 and 1. This is why, if we have a significance threshold of $$\alpha$$ we expect that a fraction of $$\alpha$$ false positives will be observed. That is, if we sample $$N$$ values from the null distribution, we will get $$\alpha\cdot N$$ significant observations. By chance. They are false positives since we consider them significantly different from what we “would expect” even if they behave exactly as expected.

### What is a p-value not

A p-value is *not* the probability that the null hypothesis is true. It *really* isn’t.

The p-values are uniformly distributed if we sample values from the null distribution, so how can it be? All p-values are *exactly* equally likely under the null hypothesis. It is not any more likely to get a p-value of 0.99 than it is of getting a p-value of 0.01, if the observations are really from the null distribution.

If you think that this distinction on what a p-value really is, is somewhat technical and not really that important, then let me ask you this: if the p-value is really the probability that the null hypothesis is true then why do we typically only reject the null hypothesis if the p-value is below 5%?

If the p-value was really that, wouldn’t we go for all p-values below 50%?

Those would be the observations where it is more likely that the observation is from the alternative distribution than from the null distribution. If you had to make a bet on whether the observation is from the null distribution or the alternative, you would be wrong 95% of the time if you only picked those where the p-value is below 0.05!

This is where the mind boggles.

If you really believe that the p-value is the probability that the null is true, then how can you go ahead and only bet on that when the p-value is below 0.05?

It is not some subtle definition here, what you are doing if you really believe that the p-value is the probability that the null model is true is just plain stupid. You are making bets against common sense.

True, if somehow false positives are much more expensive than false negatives you might have a point here, but if not you are just making a lot more mistakes than you could be making.

If you pick the significant values from the 5% significance threshold, and p-values really are the probability that the null is false, then you would end up with 95% true positives among those you select.

If that is what you are aiming for, then what you are doing makes sense. It is not what you will get – because p-values are just not what you think they are – but at least it makes sense. It does mean that only 5% of those values you pick would be false positives, though, so in the downstream analysis you shouldn’t explain away more than 5% as “probably false positives”. That would be inconsistent.

By now, if a little late in the post, I should probably say that this confusion came up in the context of genome wide association studies. Here you expect that by far the most genetic variation has no relation to any given phenotype, so if you look at genotypic variation and its association with any given phenotype, you would expect very few variations to be true positives.

If, in this context, you use a 5% significance threshold, you should – again, assuming your understanding of p-values is correct – find 95% true positives and 5% false positives.

If that is true, would you do a multiple test correction?

Wouldn’t it be just fine if you had 95% true positives and 5% false positives?

Would you really try to explain away the wast majority of your “hits” as false positives if you only expect 5% of them really to *be* false positives?

It seems a bit inconsistent to me…

### p-values and evidence

If you ever hear yourself saying that low p-values are “more significant” than high p-values, stop yourself!

Not that it is wrong, as such, but here the argument really *is* more subtle.

Under the null hypothesis, *any* p-value is *exactly* equally likely. They are uniformly distributed.

Under the null hypotheis, it is *exactly* as likely to observe a p-value of 0.99 as a p-value of $$10^{-99}$$.

If you say “more significant” you are just wrong. A p-value is either significant or not, since it boils down to whether it si under the significance threshold or not.

If you say that it is more likely to be a true positive than a false positive, you might be right, but that has to do with the distribution of p-values under the *alternative* hypothesis.

The reason we look at p-values in the first place is because we think that extreme values of $$x$$ are more likely under the alternative distribution than under the null distribution.

If something is unlikely under the null distribution, it might be less unlikely under the alternative distribution. That is why we are interested in small p-values. It has nothing to do with how likely or unlikely they are under the null distribution.

### A quick comment on p-values and sample size

Your power to detect true positives over false positives is related to sample size, in the sense that the more data you have the more likely it is that you can detect differences between the null and the alternative distribution.

But in what way?

There seems to be some confusion here as well.

If we use the right definition of p-values, then you would expect 5% false positives if you use a threshold of 5%. That is what a threshold of 5% *means*.

If you increase your sample size, would that reduce the number of false positives? No, it wouldn’t! Regardless of the sample size, the threshold is chosen such that you expect 5% false positives. The actual threshold values will chance, but not the fact that you expect 5% false positives. That is what the 5% significance threshold *is*. It is its *being*; its purpose in life; what it really *is*.

By increasing the sample size, the best you can hope for is that more *true* positives makes it below the threshold. So among the positives you will get a larger fraction of *true* positives over false positives. The absolute number of false positives cannot change unless you change the threshold.

If you sample values from the null distribution, and you use a 5% significance threshold, you get 5% false positives.

The sample size really only matters if you consider a mixture of the null distribution and the alternative distribution.

How values are distributed under such a mixture depends on the sample size, and the two distributions are easier to distinguish between the larger the sample size (hopefully, at least). You will always get 5% of the null distribution values with a 5% threshold, but you might get more and more of the alternative distribution values if you increase the sample size.

### A quicker comment on p-values and sample size

This is a bit of a trickier point.

In real life, you would not actually expect the same number of false positives if you increase the sample size.

Yes, mathematically, if you have a 5% threshold you would get 5% false positives regardless of the sample size, because 5% of the samples from the null distribution would fall below the threshold.

In real life, the number of false positives would *increase* if you increase the sample size.

What???

What happens is this: In the real world, no simple null hypothesis is true. The world is a lot more complicated than simple statistical models.

If, in an association study, your null hypothesis is that there is no genetic difference between cases and controls, you are really saying that *there is no genetic difference between cases and controls*. But there will be! If you sample at random, that is true, but you probably don’t. There will be subtle differences from sources unrelated to the disease.

Let me give you a simpler example. Consider tossing a die. The null hypothesis would be that all six sides are equally likely.

If we have two dice, one that is loaded and one that is not, we could be looking for values from the one that is loaded.

If we throw each die once and record the result, we wouldn’t be able to distinguish between them. I mean, if one is a five and the other is three, which do you think is loaded?

If, on the other hand, we throw them 100 times and notice that one of them hits 6 half of the time, we would be pretty sure that *that* is the loaded die.

Now, if we throw the dice a milion times, and test if each side is equally likely, chances are that we would conclude that *both* the dice are loaded. Because the “fair” die is actually loaded, just less so than the “loaded” die.

The “one” side is heavier than the “six” side, because how the eyes are drilled. That will show up in a milion throws of the die. Probalby not in 100 throws, but with enough throws it will.

The mathematical model of a die does not match reality. The null model is not true.

The null model is what we test for, and unless it is *exactly* true, we will reject it with enough samples.

As we increase the sample size, we will get more and more false positives.

The only reason that increasing sample size is sensible at all is because the distribution of the values under the alternative hypothesis moves away from the theoretical null distribution faster than those that are closer to the mathematical null distribution.

With increased sample size, the fraction of true positives over false positives will improve – up to a point – but the absolute number of false positives will actually improve.

In most cases not something to worry about, but with very large sample sizes and simple models you probably should.

I *have* rejected a few papers based on this, actually, where the result is rejecting a simple linear model based on an enormous sample size. Such results are really to be expected, just by chance, because the simple models are *never* true…

—

182-182=0