The problem with p-values

In Matthew Stephens’ tutorial at APBC this January, he spent a few slides arguing for Bayes factors and against p-values.  I’ve had discussions about this with statisticians in the past, but never really had enough strong arguments.  For me it is more of a gut feeling that BFs gives you a quantitative measure of the support for two alternative hypotheses, while p-values 1) a priori favours one hypothesis over the other and 2) (similarly but worse) completely ignores the alternative hypothesis.

After the tutorial, I now have some stronger arguments, and I’ve been thinking about it a bit since I got back and decided to write them down here.

If you are interested in the tutorial, you can get the outline and slides here:

Disclaimer: I’m not a statistician (or even mathematician); I’m a computer scientists with very little schooling in statistics.  I use statistics a lot in my work, but I am mainly self-taught, so take this for what it’s worth, and feel free to comment.

What is a p-value?

Matthew defines p-values as:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

which is roughly how it is usually defined, so no controversy here.

In “math”, if you have a stochastic variable \(X\) and a value \(x\), the the (one sided) p-value for \(x\) is \(P(X\geq x)\).

Again, there is nothing controversial here.  The math is what it is, nothing more and nothing less.  The problem comes when we start to interpret it.

When we do statistics, we do not have a single distribution for \(X\).  It is only in math that stochastic variables have nice, known, distributions.  In real life, \(X\) can come out as just about anything.

Saying that \(X\) can be anything of course doesn’t help us when we need to interpret data, so we do some mathematical modelling and assume that it has a certain distribution and check that this distribution looks “good enough for jazz” to describe \(X\).  We assume as little about \(X\) as is reasonable, and call it the null model (or null distribution) \(P_0\) for \(X\).

Then the p-value for an outcome \(x\) is really \(P_0(X\geq x)\).

When we do statistics, we want to test if the null model is true, that is if \(X\) really follows the \(P_0\) distribution good enough for our purpose (it never really will have a nice distribution, but it might be good enough).  So we compare the outcome of \(X\) assuming it could be one of two distributions, our null distribution \(P_0\) or an alternative distribution \(P_1\).

Hypothesis testing

What we actually do is to perform an experiment to get an outcome of \(X\), call it \(\hat{x}\), then compute \(P_0(X\geq\hat{x})\) and if this value is small enough, usually below 0.05 or 0.01, then we reject \(P_0\) and conclude that \(X\) is probably distributed as \(P_1\).

As an example, assume \(P_0\) is a normal distribution with mean 0 and standard variation 1.  Then the threshold for which \(\hat{x}\) we would accept as being under \(P_0\) is 1.64.  Below that, we accept the null hypothesis and above that we reject it.

In the plot below, if \(X\) truly is distributed as \(P_0\) we would accept an outcome with a probability that corresponds to the blue area, 95% of the total probability, and we would reject an outcome with a probability that corresponds to the orange area, 5% of the total probability.

This brings me to my two points at the top.

Notice that \(P_1\) isn’t used when testing if \(X\) is distributed as \(P_0\) or \(P_1\).  We only use \(P_0\) to make that decision.  We just prefer \(P_0\) as long as it is likely that \(X\) follows that distribution — meaning that if it does we want \(\hat{x}\) in the low 95% or 99% probability range under \(P_0\) — but we don’t consider the probability of \(\hat{x}\) under the alternative distribution \(P_1\).

Let us add an alternative distribution \(P_1\), say a N(2,1) distribution, to the plot:

If the null distribution is the true distribution for \(X\) we would still accept \(P_0\) 95% of the time and reject it 5% of the time, but if \(P_1\) is the true distribution, then we would reject \(P_1\) for what amounts to the orange area and accept \(P_1\) for what amounts to the blue area.

Is it reasonable to reject all the outcomes in the orange area?  It is not even the case that \(P_0\) is the most likely in all of that area.  In the range from 1 to 1.64, we expect more outcomes from \(P_1\) (the blue plus the orange area in the plot below) than from \(P_0\) (the orange area below).

If I were a betting man, I probably would be on \(P_0\) to the left of 1 and on \(P_1\) to the right of 1.

Mind you, this is not completely unreasonable to a priori prefer \(P_0\).  We usually pick $$P_0$$ to be the most parsimonious hypothesis, so we do want to prefer it unless evidence goes against it.  So in a sense, there is nothing wrong with a priori preferring one hypothesis over another, but does it make sense to completely ignore the alternative hypothesis when deciding if it is less or more likely to be true than the null hypothesis?

I’ll get back to this in “Bayesian hypothesis testing” below, but first there are a few more points I want to make first…

P-values are uniformly distributed; they don’t tell you all that much…

The title really says it all.  P-values are uniformly distributed (under the null hypothesis).

The outcomes are not.  It is not that each outcome \(\hat{x}\) is equally likely under \(P_0\), but the distribution of p-values of the outcomes are uniform.

To see this, consider the probability of having a p-value in a small range \(\left[p,p+\Delta p\right]\).  That p-values are uniform means that the probability of hitting this interval is \(\Delta p\) (since the full range of p-values is 0 to 1).

The p-value interval corresponds to an x interval \(\left[x-\Delta x,x\right]\) so \(p=P(X\geq x)\) and \(p+\Delta p=P(X\geq x-\Delta x)\).  So to hit the p-value interval we need \(\hat{x}\in[x-\Delta x,x]\) which happens with probability

P(x-\Delta x \leq X \leq x) = P(X \geq x-\Delta x)-P(X \geq x)\
= \left(p+\Delta p\right) – p = \Delta p

What does that mean for our hypothesis testing?

We often think of small p-values as stronger evidence against the null hypothesis, but the math doesn’t really support that.  Under the null distribution, a p-value of 10-8 is exactly as likely as a p-value of 0.99.

A p-value doesn’t tell you anything about the probability of the null hypothesis being true!  Small or large, it doesn’t matter!

The only reason that p-values are not completely worthless is that they are not uniformly distributed under the alternative distribution.  If you consider the plots above, you’ll see that we expect more high x values under \(P_1\) than \(P_0\) which means that if \(X\) is really distributed as \(P_1\) it is more likely to get a small p-value than it is under \(P_0\).

Not that we consider that in any quantitative sense when deciding whether to believe in the null or the alternative hypothesis in a hypothesis test.  There we just go for \(P_1\) for small p-values and \(P_0\) for large p-values, regardless of the distribution of p-values under \(P_1\).

Do we really never care about the alternative hypothesis?

By now I might have given you the impression that we never, ever, care about \(P_1\).  In the spirit of absolute honesty, I should say that, while we completely ignore it when testing it against \(P_0\), we do care about it when setting up an experiment.

When we set up an experiment, we do care about the alternative hypothesis.  At least we should, if we want to avoid wasting our time on the experiment.

We do what is called a power study, to figure out our chance of rejecting $$P_0$$ assuming this time that the alternative hypothesis \(P_1\) is true.  Remember that the p-values are not uniformly distributed if \(P_1\) is the true distribution for \(X\), so we can consider the probability of getting a significant p-value when \(P_1\) is true, that is we can figure out what the probability is of choosing \(P_1\) when it is in fact the true distribution.

We use this to design our experiment.  Not that we can do much about true underlying distributions (assuming such exists), but we can tweak our distributions \(P_0\) and \(P_1\) to give us a reasonable chance of choosing the right one after the experiment.  If we really do several experiments, but average the outcomes, we decrease the variance in the outcomes and thus reduce the overlap of the two hypotheses.  This way we can pick the number of samples we need to obtain any given success probability of choosing \(P_1\) assuming it is true.

So we can use the alternative hypothesis to design our study. After we design our study, however, we completely forget about \(P_1\) and test the hypothesis based only on p-values; p-values that are only based on \(P_0\).

Prior probabilities

Getting back to hypothesis testing, let’s say we have conducted our experiment and obtained the value \(\hat{x}\).  Our two alternatives, from which we have to choose, are whether the value was obtained from a \(P_0\) or a \(P_1\) distribution.

If we use our p-value approach, we have a threshold, say 5%, so if \(P_0\) is true we know we will get it right 95% of the time and wrong 5% of the time.  If, on the other hand, \(P_1\) is true, we have done our power analysis and found that with some probability \(\beta\) we choose correctly and with \(1-\beta\) we choose incorrectly.

What is the probability that we choose correctly?

You won’t be able to answer that, I’m afraid.  Essential information is missing.  Information that we intentionally ignore, because that is what you do if you take this approach to hypothesis testing.

We completely ignore the probability of \(P_0\) or \(P_1\) being true a priori.  That is, what is the probability that \(\hat{x}\) was the outcome of a \(P_0\) or \(P_1\) process in the first place.

Let us denote the “outcome” that \(X\) is drawn from \(P_0\) as \(H_0\) and similarly let \(H_1\) denote that \(X\) was drawn from \(P_1\).  Let \(A_0\) denote the outcome that we accept \(P_0\). Now, to get the probability of correctly identifying the distribution after observing an outcome, we can use Bayes’ formula:

$$P(H_0\,|\,A_0) = \frac{P(A_0\,|\,H_0)\cdot{}P(H_0)}{P(A_0)}$$


$$P(A_0) = P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_0\,|\,H_1)\cdot{}P(H_1)$$

If we use a 5% critical value we have \(P(A_0\,|\,H_0)=0.95\), and let’s assume that our power analysis gave us \(P(A_0\,|\,H_1)=0.20\) (so we have 80% chance of choosing \(P_1\) when it is true and 20% of getting it wrong when it is true).

This gives us

$$P(H_0\,|\,A_0) = \frac{0.95\cdot{}P(H_0)}{0.95\cdot{}P(H_0)+0.20\cdot{}(1-P(H_0))}$$

using the assumption that either \(P_0\) or \(P_1\) is true, so \(P(H_1)=1-P(H_0)\).

Similarly, we can obtain the probability of rejecting $$H_0$$ when it is in fact false:

$$P(H_1\,|\,A_1) = \frac{P(A_1\,|\,H_1)\cdot{}P(H_1)}{P(A_1\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)} = \frac{0.80\cdot{}(1-P(H_0))}{0.05\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))}$$

These two probabilities varies a lot as a function of \(P(H_0)\), so it is not surprising that you cannot answer the question without knowing the prior probability.

The overall success rate doesn’t vary quite as much because it is limited by construction.  If no observations are ever drawn from \(P_0\) we will reject 5%, so our success rate is 95%.  If no observations are ever drawn from \(P_1\) we will accept 80% of them (the success rate from our power analysis).

For values in between the two extremes, we have the formula, using that we choose correctly whenever we combine \(H_0\) with \(A_0\) and \(H_1\) with \(A_1\):

P(H_0,A_0)+P(H_1,A_1) & P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)\
& 0.95\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))

Notice that our success rate is always between 80% and 95%.  If \(P(H_0)=0.99\) we would be better off always choosing \(H_0\) than using the p-value strategy.

The reason we ignore the prior probabilities is philosophical rather than mathematical.  It is the old argument between Frequentist vs. Bayesian statistics.  No one disagrees about Bayes’ formula, that is pure math, but some strongly disagree on whether you can put probabilities, especially prior probabilities, on our hypotheses.

Not that I think anyone would object to the analysis of successes above; the disagreement is whether we can use probabilities of hypotheses in the actual hypothesis test.

The prior probabilities are not based on observed data, and this is why some people find it dodgy to use.  After all, it is little more than gut feeling that lets us choose it.  Well, gut feeling and experience.

In my view, avoiding them is just weaseling out of an important problem of hypothesis testing.  Prior probabilities are already implicitly there in the hypothesis test, just with default (and probably very wrong) values.

The critical value we use to determine the threshold for p-values can be thought of as an implicit prior weight on the hypothesis.  Remember that I wrote above that we implicitly prefer the null hypothesis in the hypothesis test — to the point that we completely ignore the alternative hypothesis — by accepting values that fall in the 95% probability mass of this distribution and rejecting only 5% of the probability mass?

Well, it is not always that we prefer the null hypothesis this way.  We really only “prefer” it if we choose the null hypothesis more often than we should.  Whether this is the case depends on the prior probability.  If \(P(H_0)\) is close to 1, pretty much all observations will be from the null distribution, but we are still going to reject 5% of these.

Even if \(P(H_0)\) is close to one half — so the two hypotheses are equally likely — whether we accept too many or two few observations depends on the overlap of the two hypotheses.  If they overlap significantly, we are going to accept many \(H_1\) observations as \(H_0\)

while if the overlap is very small, we are going to reject too many \(H_0\) observation (since observations close to the critical value will all be \(H_0\) and practically never \(H_1\)).

Ignoring the distributions and just using a standard 5% or 1% p-value is almost guaranteed to be a sub-optimal choice.

Choosing optimal thresholds

Since the default thresholds are unlikely to be optimal, couldn’t we use the power analysis to pick an optimal threshold?

Indeed we can. It just requires that we know \(P(H_0)\) and \(P(H_1)\).  For any critical value, \(c\), we have \(P(A_0\,|\,H_0)=P_0(X\geq c)\) and \(P(A_1\,|\,H_1)=P_1(X\geq c)\), so we can compute our success rate as

$$P(A_0,H_0)+P(A_1,H_1)=P_0(X\geq c)\cdot{}P(H_0)+P_1(X\geq c)\cdot{}P(H_1)$$

and optimise that with respect to \(c\).

There are still two problems with this approach.

First, we are picking a single threshold to choose between the two hypotheses, but that might not be the optimal approach.  If, for example, \(H_0\) is most likely to the left of one critical value, and then again to the right of another critical value.

The second problem concerns the evidence for or against the two hypotheses, after we conduct our experiment.

If we conduct our experiment and get the outcome \(\hat{x}\), with a p-value, how do we interpret the p-value as strong or weak evidence for either of the hypotheses?  Choosing the optimal critical value means that we have optimised our chance of making the right decision, but after our experiment we cannot directly interpret the evidence for or against.

Ideally, we want to know the probability of the hypotheses, taking the evidence into account, that is we want to know \(P(H_0\,|\,\hat{x})\) and \(P(H_1\,|\,\hat{x})\), but the p-value is not either of those two.  It is something else; something that we cannot directly interpret.

Bayesian hypothesis testing

If we are happy to use \(P(H_0)\) and \(P(H_1)\) to optimise our success rate, we have already broken the sacret rule of the frequentists, so we might as well go for the full Monty.

With a fully Bayesian approach, we can get the probabilities of the hypothesis a posteriori, that is, after our observations.

Since \(P(H_1\,|\,\hat{x})=1-P(H_0\,|\,\hat{x})\) we just need to work out \(P(H_0\,|\,\hat{x})\).

We get that from Bayes’ rule once more:




Ok, just for the nitpickers: here I’m using \(P(\cdot)\) both as probabilities and densities.  I know that, I just couldn’t bother introducing separate notation for densities.  Just think of a small interval \(\hat{x}\in\left[x,x+\Delta x\right]\) — which is more sensible since we never measure \(\hat{x}\) with absolute accuracy anyway — and you should be fine…

In a Bayesian hypothesis test we will typically not do exactly this, but work with odds instead.  We talk about the posterior odds \(P(H_1\,|\,\hat{x}) / P(H_0\,|\,\hat{x})\) that we can get as

$$\frac{P(H_1\,|\,\hat{x})}{P(H_0\,|\,\hat{x})}= \frac{P(\hat{x}\,|\,H_1)}{P(\hat{x}\,|\,H_0)}\times\frac{P(H_1)}{P(H_0)}$$

where \(P(\hat{x}\,|\,H_1)/P(\hat{x}\,|\,H_0)\) is called the Bayes’ factor and \(P(H_1)/P(H_0)\) the prior odds.

Whenever the posterior odds is greater than one, we should favour \(H_1\) and whenever it is smaller than one, we should favour \(H_0\).  Since it can directly be interpreted as odds, we even have a quantitative measure of how strong the evidence is, for or against.

You can interpret the Bayes’ factor as the evidence the observed data brings to the table, for or against \(H_1\).  The prior odds captures how likely we think it is that one or the other of the hypotheses are true in general (before we see the data).

You might not feel comfortable with the prior odds.  You have no data to estimate these odds from.  It is a subjective measure of how likely we think the hypotheses are, and different people might have different views on this.

Of course, once you have the Bayes’ factor from the data, people are free to use different prior odds to get the posterior odds.  You still have a quantitative measure of the evidence for or against the hypotheses.  It just depends on the prior belief in the two.

As I have argued above, the prior odds are important when deciding whether you believe in \(H_0\) or \(H_1\).  If you ignore them, and use a default p-value, you just implicitly make an assumption about this.

Our p-value might give us too many false positives or two many false negatives, but we don’t know unless we consider the prior odds.

This is especially important when \(P(H_1)\ll P(H_0)\).  Here a traditional p-value of 5% will never be a sensible choice, and you do need to consider how strong evidence you need from the data before you believe in the alternative hypothesis.

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

3 thoughts on “The problem with p-values”

  1. Interesting and well written! I come here late from Panda’s Thumb for another post, but saw this. As regards statistics I’m rather self-taught myself, but I do enjoy this topic.

    As far as I understand this, bayesian methods are great tools to contingent learning about systems and to model them (say phylogenies), but as regards theory testing to get to firm knowledge not helping understanding. It naively seems to me that bayesian statistics lives in a world where everything is variables subject to likely change, perhaps even under an experiment, or in the next minute or next room; and models subject to likely existence. While frequentist statistics acknowledge different time scales for change (parameters vs variables) and model rejection.

    This is IMHO why one must use previously standardized limits for rejection and don’t use test values for something they aren’t constructed for. (Say care about failed hypotheses and optimal thresholds.) Indeed, as opposed to the post I would be troubled if p-values were more informative than what is required for the actual test. Then I suspect the method would be wrongly constructed, or at least sub-optimal for its purpose. (Isn’t that what degrees of freedom are used for?)

    Btw, I’m not entirely sure the rejection of prior probabilities is entirely philosophical in science. I seem to remember the physicist Sean Carroll having a long discussion concerning this on bloggingheads.

    IIRC Carroll claims that a problem with, I think foremost, the classical Copenhagen interpretation is when it is combined with the idea that a quantum observable exists prior to the observation. (Heh, I even found it on Wikipedia: Counterfactual definiteness!) If one rejects this one doesn’t run into the conceptual problems with preserving local realism that Einstein did which made him propose the Einstein-Podolsky-Rosen paradox.

    So perhaps one can say that as far as physics is concerned the question is resolved, statistics must in some cases be based on measures of observations, and specifically you can’t always use a prioris. (Or you have to reject realism. All this subject to being a theoretical concern, of course. If you can wrap your head around conflicting and/or non-axiomatic methods, they can work too.)

    Whether you call this type of statistics philosophically “true”, factually correct, or simply more general, seems to me OTOH more of a philosophical concern.

    the sacret rule of the frequentists

    Secret or sacred? [A priori I could probably say: both?!] LOL!

  2. Hi Torbjörn, thanks for your comments.

    Two things, tough:

    1) the degrees of freedom are really just a way of parameterising a distribution. It is a way of fitting the null distribution of, say, a likelihood ratio test to the difference in degrees of freedom. It doesn’t change how informative a p-value is, in any way.

    2) It is absolutely true that a threshold should be chosen a priori, not a posteriori, for a propper test. But this also the case for a Bayesian test, where the prior odds should be chosen a priori before you test.

    Both frequentist and Bayesian tests essentially works by picking a threshold for a test measure and then choosing one model over another based on which side of the threshold the value falls on.

    The difference, however, is that for Bayes factors you can also interpret the test measure as degrees of evidence, something you cannot do with the p-value.

Leave a Reply