The problem with p-values
In Matthew Stephens’ tutorial at APBC this January, he spent a few slides arguing for Bayes factors and against p-values. I’ve had discussions about this with statisticians in the past, but never really had enough strong arguments. For me it is more of a gut feeling that BFs gives you a quantitative measure of the support for two alternative hypotheses, while p-values 1) a priori favours one hypothesis over the other and 2) (similarly but worse) completely ignores the alternative hypothesis.
After the tutorial, I now have some stronger arguments, and I’ve been thinking about it a bit since I got back and decided to write them down here.
If you are interested in the tutorial, you can get the outline and slides here:
Disclaimer: I’m not a statistician (or even mathematician); I’m a computer scientists with very little schooling in statistics. I use statistics a lot in my work, but I am mainly self-taught, so take this for what it’s worth, and feel free to comment.
What is a p-value?
Matthew defines p-values as:
A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.
which is roughly how it is usually defined, so no controversy here.
In “math”, if you have a stochastic variable
and a value
, the the (one sided) p-value for
is
.
Again, there is nothing controversial here. The math is what it is, nothing more and nothing less. The problem comes when we start to interpret it.
When we do statistics, we do not have a single distribution for
. It is only in math that stochastic variables have nice, known, distributions. In real life,
can come out as just about anything.
Saying that
can be anything of course doesn’t help us when we need to interpret data, so we do some mathematical modelling and assume that it has a certain distribution and check that this distribution looks “good enough for jazz” to describe
. We assume as little about
as is reasonable, and call it the null model (or null distribution)
for
.
Then the p-value for an outcome
is really
.
When we do statistics, we want to test if the null model is true, that is if
really follows the
distribution good enough for our purpose (it never really will have a nice distribution, but it might be good enough). So we compare the outcome of
assuming it could be one of two distributions, our null distribution
or an alternative distribution
.
Hypothesis testing
What we actually do is to perform an experiment to get an outcome of
, call it
, then compute
and if this value is small enough, usually below 0.05 or 0.01, then we reject
and conclude that
is probably distributed as
.
As an example, assume
is a normal distribution with mean 0 and standard variation 1. Then the threshold for which
we would accept as being under
is 1.64. Below that, we accept the null hypothesis and above that we reject it.
In the plot below, if
truly is distributed as
we would accept an outcome with a probability that corresponds to the blue area, 95% of the total probability, and we would reject an outcome with a probability that corresponds to the orange area, 5% of the total probability.

This brings me to my two points at the top.
Notice that
isn’t used when testing if
is distributed as
or
. We only use
to make that decision. We just prefer
as long as it is likely that
follows that distribution — meaning that if it does we want
in the low 95% or 99% probability range under
— but we don’t consider the probability of
under the alternative distribution
.
Let us add an alternative distribution
, say a N(2,1) distribution, to the plot:

If the null distribution is the true distribution for
we would still accept
95% of the time and reject it 5% of the time, but if
is the true distribution, then we would reject
for what amounts to the orange area and accept
for what amounts to the blue area.
Is it reasonable to reject all the outcomes in the orange area? It is not even the case that
is the most likely in all of that area. In the range from 1 to 1.64, we expect more outcomes from
(the blue plus the orange area in the plot below) than from
(the orange area below).

If I were a betting man, I probably would be on
to the left of 1 and on
to the right of 1.

Mind you, this is not completely unreasonable to a priori prefer
. We usually pick
to be the most parsimonious hypothesis, so we do want to prefer it unless evidence goes against it. So in a sense, there is nothing wrong with a priori preferring one hypothesis over another, but does it make sense to completely ignore the alternative hypothesis when deciding if it is less or more likely to be true than the null hypothesis?
I’ll get back to this in “Bayesian hypothesis testing” below, but first there are a few more points I want to make first…
P-values are uniformly distributed; they don’t tell you all that much…
The title really says it all. P-values are uniformly distributed (under the null hypothesis).
The outcomes are not. It is not that each outcome
is equally likely under
, but the distribution of p-values of the outcomes are uniform.
To see this, consider the probability of having a p-value in a small range
. That p-values are uniform means that the probability of hitting this interval is
(since the full range of p-values is 0 to 1).
The p-value interval corresponds to an x interval
so
and
. So to hit the p-value interval we need
which happens with probability

What does that mean for our hypothesis testing?
We often think of small p-values as stronger evidence against the null hypothesis, but the math doesn’t really support that. Under the null distribution, a p-value of 10-8 is exactly as likely as a p-value of 0.99.
A p-value doesn’t tell you anything about the probability of the null hypothesis being true! Small or large, it doesn’t matter!
The only reason that p-values are not completely worthless is that they are not uniformly distributed under the alternative distribution. If you consider the plots above, you’ll see that we expect more high x values under
than
which means that if
is really distributed as
it is more likely to get a small p-value than it is under
.
Not that we consider that in any quantitative sense when deciding whether to believe in the null or the alternative hypothesis in a hypothesis test. There we just go for
for small p-values and
for large p-values, regardless of the distribution of p-values under
.
Do we really never care about the alternative hypothesis?
By now I might have given you the impression that we never, ever, care about
. In the spirit of absolute honesty, I should say that, while we completely ignore it when testing it against
, we do care about it when setting up an experiment.
When we set up an experiment, we do care about the alternative hypothesis. At least we should, if we want to avoid wasting our time on the experiment.
We do what is called a power study, to figure out our chance of rejecting
assuming this time that the alternative hypothesis
is true. Remember that the p-values are not uniformly distributed if
is the true distribution for
, so we can consider the probability of getting a significant p-value when
is true, that is we can figure out what the probability is of choosing
when it is in fact the true distribution.
We use this to design our experiment. Not that we can do much about true underlying distributions (assuming such exists), but we can tweak our distributions
and
to give us a reasonable chance of choosing the right one after the experiment. If we really do several experiments, but average the outcomes, we decrease the variance in the outcomes and thus reduce the overlap of the two hypotheses. This way we can pick the number of samples we need to obtain any given success probability of choosing
assuming it is true.

So we can use the alternative hypothesis to design our study. After we design our study, however, we completely forget about
and test the hypothesis based only on p-values; p-values that are only based on
.
Prior probabilities
Getting back to hypothesis testing, let’s say we have conducted our experiment and obtained the value
. Our two alternatives, from which we have to choose, are whether the value was obtained from a
or a
distribution.
If we use our p-value approach, we have a threshold, say 5%, so if
is true we know we will get it right 95% of the time and wrong 5% of the time. If, on the other hand,
is true, we have done our power analysis and found that with some probability
we choose correctly and with
we choose incorrectly.
What is the probability that we choose correctly?
You won’t be able to answer that, I’m afraid. Essential information is missing. Information that we intentionally ignore, because that is what you do if you take this approach to hypothesis testing.
We completely ignore the probability of
or
being true a priori. That is, what is the probability that
was the outcome of a
or
process in the first place.
Let us denote the “outcome” that
is drawn from
as
and similarly let
denote that
was drawn from
. Let
denote the outcome that we accept
. Now, to get the probability of correctly identifying the distribution after observing an outcome, we can use Bayes’ formula:

where

If we use a 5% critical value we have
, and let’s assume that our power analysis gave us
(so we have 80% chance of choosing
when it is true and 20% of getting it wrong when it is true).
This gives us

using the assumption that either
or
is true, so
.

Similarly, we can obtain the probability of rejecting
when it is in fact false:


These two probabilities varies a lot as a function of
, so it is not surprising that you cannot answer the question without knowing the prior probability.
The overall success rate doesn’t vary quite as much because it is limited by construction. If no observations are ever drawn from
we will reject 5%, so our success rate is 95%. If no observations are ever drawn from
we will accept 80% of them (the success rate from our power analysis).
For values in between the two extremes, we have the formula, using that we choose correctly whenever we combine
with
and
with
:


Notice that our success rate is always between 80% and 95%. If
we would be better off always choosing
than using the p-value strategy.
The reason we ignore the prior probabilities is philosophical rather than mathematical. It is the old argument between Frequentist vs. Bayesian statistics. No one disagrees about Bayes’ formula, that is pure math, but some strongly disagree on whether you can put probabilities, especially prior probabilities, on our hypotheses.
Not that I think anyone would object to the analysis of successes above; the disagreement is whether we can use probabilities of hypotheses in the actual hypothesis test.
The prior probabilities are not based on observed data, and this is why some people find it dodgy to use. After all, it is little more than gut feeling that lets us choose it. Well, gut feeling and experience.
In my view, avoiding them is just weaseling out of an important problem of hypothesis testing. Prior probabilities are already implicitly there in the hypothesis test, just with default (and probably very wrong) values.
The critical value we use to determine the threshold for p-values can be thought of as an implicit prior weight on the hypothesis. Remember that I wrote above that we implicitly prefer the null hypothesis in the hypothesis test — to the point that we completely ignore the alternative hypothesis — by accepting values that fall in the 95% probability mass of this distribution and rejecting only 5% of the probability mass?
Well, it is not always that we prefer the null hypothesis this way. We really only “prefer” it if we choose the null hypothesis more often than we should. Whether this is the case depends on the prior probability. If
is close to 1, pretty much all observations will be from the null distribution, but we are still going to reject 5% of these.
Even if
is close to one half — so the two hypotheses are equally likely — whether we accept too many or two few observations depends on the overlap of the two hypotheses. If they overlap significantly, we are going to accept many
observations as 

while if the overlap is very small, we are going to reject too many
observation (since observations close to the critical value will all be
and practically never
).

Ignoring the distributions and just using a standard 5% or 1% p-value is almost guaranteed to be a sub-optimal choice.
Choosing optimal thresholds
Since the default thresholds are unlikely to be optimal, couldn’t we use the power analysis to pick an optimal threshold?
Indeed we can. It just requires that we know
and
. For any critical value,
, we have
and
, so we can compute our success rate as

and optimise that with respect to
.
There are still two problems with this approach.
First, we are picking a single threshold to choose between the two hypotheses, but that might not be the optimal approach. If, for example,
is most likely to the left of one critical value, and then again to the right of another critical value.

The second problem concerns the evidence for or against the two hypotheses, after we conduct our experiment.
If we conduct our experiment and get the outcome
, with a p-value, how do we interpret the p-value as strong or weak evidence for either of the hypotheses? Choosing the optimal critical value means that we have optimised our chance of making the right decision, but after our experiment we cannot directly interpret the evidence for or against.
Ideally, we want to know the probability of the hypotheses, taking the evidence into account, that is we want to know
and
, but the p-value is not either of those two. It is something else; something that we cannot directly interpret.
Bayesian hypothesis testing
If we are happy to use
and
to optimise our success rate, we have already broken the sacret rule of the frequentists, so we might as well go for the full Monty.
With a fully Bayesian approach, we can get the probabilities of the hypothesis a posteriori, that is, after our observations.
Since
we just need to work out
.
We get that from Bayes’ rule once more:

where

Ok, just for the nitpickers: here I’m using
both as probabilities and densities. I know that, I just couldn’t bother introducing separate notation for densities. Just think of a small interval
— which is more sensible since we never measure
with absolute accuracy anyway — and you should be fine…
In a Bayesian hypothesis test we will typically not do exactly this, but work with odds instead. We talk about the posterior odds
that we can get as

where
is called the Bayes’ factor and
the prior odds.
Whenever the posterior odds is greater than one, we should favour
and whenever it is smaller than one, we should favour
. Since it can directly be interpreted as odds, we even have a quantitative measure of how strong the evidence is, for or against.
You can interpret the Bayes’ factor as the evidence the observed data brings to the table, for or against
. The prior odds captures how likely we think it is that one or the other of the hypotheses are true in general (before we see the data).
You might not feel comfortable with the prior odds. You have no data to estimate these odds from. It is a subjective measure of how likely we think the hypotheses are, and different people might have different views on this.
Of course, once you have the Bayes’ factor from the data, people are free to use different prior odds to get the posterior odds. You still have a quantitative measure of the evidence for or against the hypotheses. It just depends on the prior belief in the two.
As I have argued above, the prior odds are important when deciding whether you believe in
or
. If you ignore them, and use a default p-value, you just implicitly make an assumption about this.
Our p-value might give us too many false positives or two many false negatives, but we don’t know unless we consider the prior odds.
This is especially important when
. Here a traditional p-value of 5% will never be a sensible choice, and you do need to consider how strong evidence you need from the data before you believe in the alternative hypothesis.
–
28-44=-16
February 13th, 2009 at 9:55 pm
[...] just saw a great quote that reminds me of the post on p-values I wrote a few days [...]
February 23rd, 2009 at 1:04 pm
Interesting and well written! I come here late from Panda’s Thumb for another post, but saw this. As regards statistics I’m rather self-taught myself, but I do enjoy this topic.
As far as I understand this, bayesian methods are great tools to contingent learning about systems and to model them (say phylogenies), but as regards theory testing to get to firm knowledge not helping understanding. It naively seems to me that bayesian statistics lives in a world where everything is variables subject to likely change, perhaps even under an experiment, or in the next minute or next room; and models subject to likely existence. While frequentist statistics acknowledge different time scales for change (parameters vs variables) and model rejection.
This is IMHO why one must use previously standardized limits for rejection and don’t use test values for something they aren’t constructed for. (Say care about failed hypotheses and optimal thresholds.) Indeed, as opposed to the post I would be troubled if p-values were more informative than what is required for the actual test. Then I suspect the method would be wrongly constructed, or at least sub-optimal for its purpose. (Isn’t that what degrees of freedom are used for?)
Btw, I’m not entirely sure the rejection of prior probabilities is entirely philosophical in science. I seem to remember the physicist Sean Carroll having a long discussion concerning this on bloggingheads.
IIRC Carroll claims that a problem with, I think foremost, the classical Copenhagen interpretation is when it is combined with the idea that a quantum observable exists prior to the observation. (Heh, I even found it on Wikipedia: Counterfactual definiteness!) If one rejects this one doesn’t run into the conceptual problems with preserving local realism that Einstein did which made him propose the Einstein-Podolsky-Rosen paradox.
So perhaps one can say that as far as physics is concerned the question is resolved, statistics must in some cases be based on measures of observations, and specifically you can’t always use a prioris. (Or you have to reject realism. All this subject to being a theoretical concern, of course. If you can wrap your head around conflicting and/or non-axiomatic methods, they can work too.)
Whether you call this type of statistics philosophically “true”, factually correct, or simply more general, seems to me OTOH more of a philosophical concern.
Secret or sacred? [A priori I could probably say: both?!] LOL!
February 23rd, 2009 at 4:52 pm
Hi Torbjörn, thanks for your comments.
Two things, tough:
1) the degrees of freedom are really just a way of parameterising a distribution. It is a way of fitting the null distribution of, say, a likelihood ratio test to the difference in degrees of freedom. It doesn’t change how informative a p-value is, in any way.
2) It is absolutely true that a threshold should be chosen a priori, not a posteriori, for a propper test. But this also the case for a Bayesian test, where the prior odds should be chosen a priori before you test.
Both frequentist and Bayesian tests essentially works by picking a threshold for a test measure and then choosing one model over another based on which side of the threshold the value falls on.
The difference, however, is that for Bayes factors you can also interpret the test measure as degrees of evidence, something you cannot do with the p-value.