The problem with p-values

In Matthew Stephens’ tutorial at APBC this January, he spent a few slides arguing for Bayes factors and against p-values.  I’ve had discussions about this with statisticians in the past, but never really had enough strong arguments.  For me it is more of a gut feeling that BFs gives you a quantitative measure of the support for two alternative hypotheses, while p-values 1) a priori favours one hypothesis over the other and 2) (similarly but worse) completely ignores the alternative hypothesis.

After the tutorial, I now have some stronger arguments, and I’ve been thinking about it a bit since I got back and decided to write them down here.

If you are interested in the tutorial, you can get the outline and slides here:

Disclaimer: I’m not a statistician (or even mathematician); I’m a computer scientists with very little schooling in statistics.  I use statistics a lot in my work, but I am mainly self-taught, so take this for what it’s worth, and feel free to comment.

What is a p-value?

Matthew defines p-values as:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

which is roughly how it is usually defined, so no controversy here.

In “math”, if you have a stochastic variable X and a value x, the the (one sided) p-value for x is P(X\geq x).

Again, there is nothing controversial here.  The math is what it is, nothing more and nothing less.  The problem comes when we start to interpret it.

When we do statistics, we do not have a single distribution for X.  It is only in math that stochastic variables have nice, known, distributions.  In real life, X can come out as just about anything.

Saying that X can be anything of course doesn’t help us when we need to interpret data, so we do some mathematical modelling and assume that it has a certain distribution and check that this distribution looks “good enough for jazz” to describe X.  We assume as little about X as is reasonable, and call it the null model (or null distribution) P_0 for X.

Then the p-value for an outcome x is really P_0(X\geq x).

When we do statistics, we want to test if the null model is true, that is if X really follows the P_0 distribution good enough for our purpose (it never really will have a nice distribution, but it might be good enough).  So we compare the outcome of X assuming it could be one of two distributions, our null distribution P_0 or an alternative distribution P_1.

Hypothesis testing

What we actually do is to perform an experiment to get an outcome of X, call it \hat{x}, then compute P_0(X\geq\hat{x}) and if this value is small enough, usually below 0.05 or 0.01, then we reject P_0 and conclude that X is probably distributed as P_1.

As an example, assume P_0 is a normal distribution with mean 0 and standard variation 1.  Then the threshold for which \hat{x} we would accept as being under P_0 is 1.64.  Below that, we accept the null hypothesis and above that we reject it.

In the plot below, if X truly is distributed as P_0 we would accept an outcome with a probability that corresponds to the blue area, 95% of the total probability, and we would reject an outcome with a probability that corresponds to the orange area, 5% of the total probability.

This brings me to my two points at the top.

Notice that P_1 isn’t used when testing if X is distributed as P_0 or P_1.  We only use P_0 to make that decision.  We just prefer P_0 as long as it is likely that X follows that distribution — meaning that if it does we want \hat{x} in the low 95% or 99% probability range under P_0 — but we don’t consider the probability of \hat{x} under the alternative distribution P_1.

Let us add an alternative distribution P_1, say a N(2,1) distribution, to the plot:

If the null distribution is the true distribution for X we would still accept P_0 95% of the time and reject it 5% of the time, but if P_1 is the true distribution, then we would reject P_1 for what amounts to the orange area and accept P_1 for what amounts to the blue area.

Is it reasonable to reject all the outcomes in the orange area?  It is not even the case that P_0 is the most likely in all of that area.  In the range from 1 to 1.64, we expect more outcomes from P_1 (the blue plus the orange area in the plot below) than from P_0 (the orange area below).

If I were a betting man, I probably would be on P_0 to the left of 1 and on P_1 to the right of 1.

Mind you, this is not completely unreasonable to a priori prefer P_0.  We usually pick P_0 to be the most parsimonious hypothesis, so we do want to prefer it unless evidence goes against it.  So in a sense, there is nothing wrong with a priori preferring one hypothesis over another, but does it make sense to completely ignore the alternative hypothesis when deciding if it is less or more likely to be true than the null hypothesis?

I’ll get back to this in “Bayesian hypothesis testing” below, but first there are a few more points I want to make first…

P-values are uniformly distributed; they don’t tell you all that much…

The title really says it all.  P-values are uniformly distributed (under the null hypothesis).

The outcomes are not.  It is not that each outcome \hat{x} is equally likely under P_0, but the distribution of p-values of the outcomes are uniform.

To see this, consider the probability of having a p-value in a small range \left[p,p+\Delta p\right].  That p-values are uniform means that the probability of hitting this interval is \Delta p (since the full range of p-values is 0 to 1).

The p-value interval corresponds to an x interval \left[x-\Delta x,x\right] so p=P(X\geq x) and p+\Delta p=P(X\geq x-\Delta x).  So to hit the p-value interval we need \hat{x}\in[x-\Delta x,x] which happens with probability

\begin{array}{rcl}</p>
<p>P(x-\Delta x \leq X \leq x) = P(X \geq x-\Delta x)-P(X \geq x)\\</p>
<p>= \left(p+\Delta p\right) - p = \Delta p</p>
<p>\end{array}

What does that mean for our hypothesis testing?

We often think of small p-values as stronger evidence against the null hypothesis, but the math doesn’t really support that.  Under the null distribution, a p-value of 10-8 is exactly as likely as a p-value of 0.99.

A p-value doesn’t tell you anything about the probability of the null hypothesis being true!  Small or large, it doesn’t matter!

The only reason that p-values are not completely worthless is that they are not uniformly distributed under the alternative distribution.  If you consider the plots above, you’ll see that we expect more high x values under P_1 than P_0 which means that if X is really distributed as P_1 it is more likely to get a small p-value than it is under P_0.

Not that we consider that in any quantitative sense when deciding whether to believe in the null or the alternative hypothesis in a hypothesis test.  There we just go for P_1 for small p-values and P_0 for large p-values, regardless of the distribution of p-values under P_1.

Do we really never care about the alternative hypothesis?

By now I might have given you the impression that we never, ever, care about P_1.  In the spirit of absolute honesty, I should say that, while we completely ignore it when testing it against P_0, we do care about it when setting up an experiment.

When we set up an experiment, we do care about the alternative hypothesis.  At least we should, if we want to avoid wasting our time on the experiment.

We do what is called a power study, to figure out our chance of rejecting P_0 assuming this time that the alternative hypothesis P_1 is true.  Remember that the p-values are not uniformly distributed if P_1 is the true distribution for X, so we can consider the probability of getting a significant p-value when P_1 is true, that is we can figure out what the probability is of choosing P_1 when it is in fact the true distribution.

We use this to design our experiment.  Not that we can do much about true underlying distributions (assuming such exists), but we can tweak our distributions P_0 and P_1 to give us a reasonable chance of choosing the right one after the experiment.  If we really do several experiments, but average the outcomes, we decrease the variance in the outcomes and thus reduce the overlap of the two hypotheses.  This way we can pick the number of samples we need to obtain any given success probability of choosing P_1 assuming it is true.

So we can use the alternative hypothesis to design our study. After we design our study, however, we completely forget about P_1 and test the hypothesis based only on p-values; p-values that are only based on P_0.

Prior probabilities

Getting back to hypothesis testing, let’s say we have conducted our experiment and obtained the value \hat{x}.  Our two alternatives, from which we have to choose, are whether the value was obtained from a P_0 or a P_1 distribution.

If we use our p-value approach, we have a threshold, say 5%, so if P_0 is true we know we will get it right 95% of the time and wrong 5% of the time.  If, on the other hand, P_1 is true, we have done our power analysis and found that with some probability \beta we choose correctly and with 1-\beta we choose incorrectly.

What is the probability that we choose correctly?

You won’t be able to answer that, I’m afraid.  Essential information is missing.  Information that we intentionally ignore, because that is what you do if you take this approach to hypothesis testing.

We completely ignore the probability of P_0 or P_1 being true a priori.  That is, what is the probability that \hat{x} was the outcome of a P_0 or P_1 process in the first place.

Let us denote the “outcome” that X is drawn from P_0 as H_0 and similarly let H_1 denote that X was drawn from P_1.  Let A_0 denote the outcome that we accept P_0. Now, to get the probability of correctly identifying the distribution after observing an outcome, we can use Bayes’ formula:

P(H_0\,|\,A_0) = \frac{P(A_0\,|\,H_0)\cdot{}P(H_0)}{P(A_0)}

where

P(A_0) = P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_0\,|\,H_1)\cdot{}P(H_1)

If we use a 5% critical value we have P(A_0\,|\,H_0)=0.95, and let’s assume that our power analysis gave us P(A_0\,|\,H_1)=0.20 (so we have 80% chance of choosing P_1 when it is true and 20% of getting it wrong when it is true).

This gives us

P(H_0\,|\,A_0) = \frac{0.95\cdot{}P(H_0)}{0.95\cdot{}P(H_0)+0.20\cdot{}(1-P(H_0))}

using the assumption that either P_0 or P_1 is true, so P(H_1)=1-P(H_0).

Similarly, we can obtain the probability of rejecting H_0 when it is in fact false:

P(H_1\,|\,A_1) = \frac{P(A_1\,|\,H_1)\cdot{}P(H_1)}{P(A_1\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)} = \frac{0.80\cdot{}(1-P(H_0))}{0.05\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))}

These two probabilities varies a lot as a function of P(H_0), so it is not surprising that you cannot answer the question without knowing the prior probability.

The overall success rate doesn’t vary quite as much because it is limited by construction.  If no observations are ever drawn from P_0 we will reject 5%, so our success rate is 95%.  If no observations are ever drawn from P_1 we will accept 80% of them (the success rate from our power analysis).

For values in between the two extremes, we have the formula, using that we choose correctly whenever we combine H_0 with A_0 and H_1 with A_1:

\begin{array}{rcl}</p>
<p>P(H_0,A_0)+P(H_1,A_1)&=&P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)\\</p>
<p>&=&0.95\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))</p>
<p>\end{array}

Notice that our success rate is always between 80% and 95%.  If P(H_0)=0.99 we would be better off always choosing H_0 than using the p-value strategy.

The reason we ignore the prior probabilities is philosophical rather than mathematical.  It is the old argument between Frequentist vs. Bayesian statistics.  No one disagrees about Bayes’ formula, that is pure math, but some strongly disagree on whether you can put probabilities, especially prior probabilities, on our hypotheses.

Not that I think anyone would object to the analysis of successes above; the disagreement is whether we can use probabilities of hypotheses in the actual hypothesis test.

The prior probabilities are not based on observed data, and this is why some people find it dodgy to use.  After all, it is little more than gut feeling that lets us choose it.  Well, gut feeling and experience.

In my view, avoiding them is just weaseling out of an important problem of hypothesis testing.  Prior probabilities are already implicitly there in the hypothesis test, just with default (and probably very wrong) values.

The critical value we use to determine the threshold for p-values can be thought of as an implicit prior weight on the hypothesis.  Remember that I wrote above that we implicitly prefer the null hypothesis in the hypothesis test — to the point that we completely ignore the alternative hypothesis — by accepting values that fall in the 95% probability mass of this distribution and rejecting only 5% of the probability mass?

Well, it is not always that we prefer the null hypothesis this way.  We really only “prefer” it if we choose the null hypothesis more often than we should.  Whether this is the case depends on the prior probability.  If P(H_0) is close to 1, pretty much all observations will be from the null distribution, but we are still going to reject 5% of these.

Even if P(H_0) is close to one half — so the two hypotheses are equally likely — whether we accept too many or two few observations depends on the overlap of the two hypotheses.  If they overlap significantly, we are going to accept many H_1 observations as H_0

while if the overlap is very small, we are going to reject too many H_0 observation (since observations close to the critical value will all be H_0 and practically never H_1).

Ignoring the distributions and just using a standard 5% or 1% p-value is almost guaranteed to be a sub-optimal choice.

Choosing optimal thresholds

Since the default thresholds are unlikely to be optimal, couldn’t we use the power analysis to pick an optimal threshold?

Indeed we can. It just requires that we know P(H_0) and P(H_1).  For any critical value, c, we have P(A_0\,|\,H_0)=P_0(X\geq c) and P(A_1\,|\,H_1)=P_1(X\geq c), so we can compute our success rate as

P(A_0,H_0)+P(A_1,H_1)=P_0(X\geq c)\cdot{}P(H_0)+P_1(X\geq c)\cdot{}P(H_1)

and optimise that with respect to c.

There are still two problems with this approach.

First, we are picking a single threshold to choose between the two hypotheses, but that might not be the optimal approach.  If, for example, H_0 is most likely to the left of one critical value, and then again to the right of another critical value.

The second problem concerns the evidence for or against the two hypotheses, after we conduct our experiment.

If we conduct our experiment and get the outcome \hat{x}, with a p-value, how do we interpret the p-value as strong or weak evidence for either of the hypotheses?  Choosing the optimal critical value means that we have optimised our chance of making the right decision, but after our experiment we cannot directly interpret the evidence for or against.

Ideally, we want to know the probability of the hypotheses, taking the evidence into account, that is we want to know P(H_0\,|\,\hat{x}) and P(H_1\,|\,\hat{x}), but the p-value is not either of those two.  It is something else; something that we cannot directly interpret.

Bayesian hypothesis testing

If we are happy to use P(H_0) and P(H_1) to optimise our success rate, we have already broken the sacret rule of the frequentists, so we might as well go for the full Monty.

With a fully Bayesian approach, we can get the probabilities of the hypothesis a posteriori, that is, after our observations.

Since P(H_1\,|\,\hat{x})=1-P(H_0\,|\,\hat{x}) we just need to work out P(H_0\,|\,\hat{x}).

We get that from Bayes’ rule once more:

P(H_0\,|\,\hat{x})=\frac{P(\hat{x}\,|\,H_0)P(H_0)}{P(\hat{x})}

where

P(\hat{x})=P(\hat{x}\,|\,H_0)\cdot{}P(H_0)+P(\hat{x}\,|\,H_1)\cdot{}P(H_1)

Ok, just for the nitpickers: here I’m using P(\cdot) both as probabilities and densities.  I know that, I just couldn’t bother introducing separate notation for densities.  Just think of a small interval \hat{x}\in\left[x,x+\Delta x\right] — which is more sensible since we never measure \hat{x} with absolute accuracy anyway — and you should be fine…

In a Bayesian hypothesis test we will typically not do exactly this, but work with odds instead.  We talk about the posterior odds P(H_1\,|\,\hat{x}) / P(H_0\,|\,\hat{x}) that we can get as

\frac{P(H_1\,|\,\hat{x})}{P(H_0\,|\,\hat{x})}= \frac{P(\hat{x}\,|\,H_1)}{P(\hat{x}\,|\,H_0)}\times\frac{P(H_1)}{P(H_0)}

where P(\hat{x}\,|\,H_1)/P(\hat{x}\,|\,H_0) is called the Bayes’ factor and P(H_1)/P(H_0) the prior odds.

Whenever the posterior odds is greater than one, we should favour H_1 and whenever it is smaller than one, we should favour H_0.  Since it can directly be interpreted as odds, we even have a quantitative measure of how strong the evidence is, for or against.

You can interpret the Bayes’ factor as the evidence the observed data brings to the table, for or against H_1.  The prior odds captures how likely we think it is that one or the other of the hypotheses are true in general (before we see the data).

You might not feel comfortable with the prior odds.  You have no data to estimate these odds from.  It is a subjective measure of how likely we think the hypotheses are, and different people might have different views on this.

Of course, once you have the Bayes’ factor from the data, people are free to use different prior odds to get the posterior odds.  You still have a quantitative measure of the evidence for or against the hypotheses.  It just depends on the prior belief in the two.

As I have argued above, the prior odds are important when deciding whether you believe in H_0 or H_1.  If you ignore them, and use a default p-value, you just implicitly make an assumption about this.

Our p-value might give us too many false positives or two many false negatives, but we don’t know unless we consider the prior odds.

This is especially important when P(H_1)\ll P(H_0).  Here a traditional p-value of 5% will never be a sensible choice, and you do need to consider how strong evidence you need from the data before you believe in the alternative hypothesis.

28-44=-16

Tags:

3 Responses to “The problem with p-values”

  1. Mailund on the Internet » Blog Archive » The problem with p-values (again) Says:

    [...] just saw a great quote that reminds me of the post on p-values I wrote a few days [...]

  2. Torbjörn Larsson, OM Says:

    Interesting and well written! I come here late from Panda’s Thumb for another post, but saw this. As regards statistics I’m rather self-taught myself, but I do enjoy this topic.

    As far as I understand this, bayesian methods are great tools to contingent learning about systems and to model them (say phylogenies), but as regards theory testing to get to firm knowledge not helping understanding. It naively seems to me that bayesian statistics lives in a world where everything is variables subject to likely change, perhaps even under an experiment, or in the next minute or next room; and models subject to likely existence. While frequentist statistics acknowledge different time scales for change (parameters vs variables) and model rejection.

    This is IMHO why one must use previously standardized limits for rejection and don’t use test values for something they aren’t constructed for. (Say care about failed hypotheses and optimal thresholds.) Indeed, as opposed to the post I would be troubled if p-values were more informative than what is required for the actual test. Then I suspect the method would be wrongly constructed, or at least sub-optimal for its purpose. (Isn’t that what degrees of freedom are used for?)

    Btw, I’m not entirely sure the rejection of prior probabilities is entirely philosophical in science. I seem to remember the physicist Sean Carroll having a long discussion concerning this on bloggingheads.

    IIRC Carroll claims that a problem with, I think foremost, the classical Copenhagen interpretation is when it is combined with the idea that a quantum observable exists prior to the observation. (Heh, I even found it on Wikipedia: Counterfactual definiteness!) If one rejects this one doesn’t run into the conceptual problems with preserving local realism that Einstein did which made him propose the Einstein-Podolsky-Rosen paradox.

    So perhaps one can say that as far as physics is concerned the question is resolved, statistics must in some cases be based on measures of observations, and specifically you can’t always use a prioris. (Or you have to reject realism. All this subject to being a theoretical concern, of course. If you can wrap your head around conflicting and/or non-axiomatic methods, they can work too.)

    Whether you call this type of statistics philosophically “true”, factually correct, or simply more general, seems to me OTOH more of a philosophical concern.

    the sacret rule of the frequentists

    Secret or sacred? [A priori I could probably say: both?!] LOL!

  3. Thomas Mailund Says:

    Hi Torbjörn, thanks for your comments.

    Two things, tough:

    1) the degrees of freedom are really just a way of parameterising a distribution. It is a way of fitting the null distribution of, say, a likelihood ratio test to the difference in degrees of freedom. It doesn’t change how informative a p-value is, in any way.

    2) It is absolutely true that a threshold should be chosen a priori, not a posteriori, for a propper test. But this also the case for a Bayesian test, where the prior odds should be chosen a priori before you test.

    Both frequentist and Bayesian tests essentially works by picking a threshold for a test measure and then choosing one model over another based on which side of the threshold the value falls on.

    The difference, however, is that for Bayes factors you can also interpret the test measure as degrees of evidence, something you cannot do with the p-value.

Leave a Reply