Michael suggested that I tried out SVG images instead of antialiasing PNG images for my graphs here.

Well, here goes:

### PNG file:

### SVG file:

—

28-46=-18

Skip to content
# Day: January 28, 2009

## SVG plots

### PNG file:

### SVG file:

## Is less *really* more?

## The problem with p-values

### What is a p-value?

### Hypothesis testing

### P-values are uniformly distributed; they don’t tell you all that much…

### Do we really never care about the alternative hypothesis?

### Prior probabilities

### Choosing optimal thresholds

### Bayesian hypothesis testing

Computer science, bioinformatics, genetics, and everything in between

Michael suggested that I tried out SVG images instead of antialiasing PNG images for my graphs here.

Well, here goes:

—

28-46=-18

This leader in The Economist argues that we are now using Moore’s law to to get *cheaper* computers, rather than *more powerful* computers.

Constant improvements mean that more features can be added to these products each year without increasing the price. A desire to do ever more elaborate things with computers—in particular, to supply and consume growing volumes of information over the internet—kept people and companies upgrading. Each time they bought a new machine, it cost around the same as the previous one, but did a lot more. But now things are changing, partly because the industry is maturing, and partly because of the recession. Suddenly there is much more interest in products that apply the flip side of Moore’s law: instead of providing ever-increasing performance at a particular price, they provide a particular level of performance at an ever-lower price.

I’m not sure that I agree.

Sure, our current computers are “good enough” for what we use them for. Office applications, net surfing, watching a movie when on the move, etc. but we still want *more*.

We want new features. Most features in an office package we will never use, but all those that are there are there because *someone* needed them, and when *you* want a feature, you want it there.

The features we want are probably more specialised. The basic features that everyone uses have been around for ages. So a new feature that you would love to see, would probably only benefit a few, but it would be great for those few.

I think that what is changing is that we only want the features we *need* and not those features that everyone *else* needs.

We don’t want to pay for an upgrade that adds 100 features where we only need one of them. We just want the one feature we need.

So our approach to computing changes. We move online.

We are happy to get features from Internet services that gives us what we need, but we don’t want to have all those features we *don’t* need installed on our local machine. Slowing everything down and confusing our use experience.

The reason “net books” are hot is not that they are cheaper as such. Sure, it helps on the sales that they are cheap, but they also provide the services we need and are likely to provide *more and more* services over time.

It is just an interface to the Net, and the services there keep getting better.

We are not demanding less, we have just realised that the computations doesn’t have to run on our desktop. They can run somewhere else. In the “cloud”.

It’s grid computing, baby. Cloud computing.

Your interface to it might be getting cheaper — and why not? — but you still want more and more.

—

28-45=-17

In Matthew Stephens’ tutorial at APBC this January, he spent a few slides arguing for Bayes factors and against p-values. I’ve had discussions about this with statisticians in the past, but never really had enough strong arguments. For me it is more of a gut feeling that BFs gives you a quantitative measure of the support for two alternative hypotheses, while p-values 1) *a priori* favours one hypothesis over the other and 2) (similarly but worse) completely *ignores* the alternative hypothesis.

After the tutorial, I now have some stronger arguments, and I’ve been thinking about it a bit since I got back and decided to write them down here.

If you are interested in the tutorial, you can get the outline and slides here:

**Disclaimer**: I’m not a statistician (or even mathematician); I’m a computer scientists with very little schooling in statistics. I use statistics a lot in my work, but I am mainly self-taught, so take this for what it’s worth, and feel free to comment.

Matthew defines p-values as:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

which is roughly how it is usually defined, so no controversy here.

In “math”, if you have a stochastic variable \(X\) and a value \(x\), the the (one sided) p-value for \(x\) is \(P(X\geq x)\).

Again, there is nothing controversial here. The math is what it is, nothing more and nothing less. The problem comes when we start to interpret it.

When we do statistics, we do not have a single distribution for \(X\). It is only in math that stochastic variables have nice, known, distributions. In real life, \(X\) can come out as just about anything.

Saying that \(X\) can be anything of course doesn’t help us when we need to interpret data, so we do some mathematical modelling and assume that it has a certain distribution and check that this distribution looks “good enough for jazz” to describe \(X\). We assume as little about \(X\) as is reasonable, and call it the null model (or null distribution) \(P_0\) for \(X\).

Then the p-value for an outcome \(x\) is really \(P_0(X\geq x)\).

When we do statistics, we want to test if the null model is true, that is if \(X\) really follows the \(P_0\) distribution good enough for our purpose (it never really *will* have a nice distribution, but it might be *good enough*). So we compare the outcome of \(X\) assuming it could be one of two distributions, our null distribution \(P_0\) or an alternative distribution \(P_1\).

What we *actually* do is to perform an experiment to get an outcome of \(X\), call it \(\hat{x}\), then compute \(P_0(X\geq\hat{x})\) and if this value is small enough, usually below 0.05 or 0.01, then we reject \(P_0\) and conclude that \(X\) is probably distributed as \(P_1\).

As an example, assume \(P_0\) is a normal distribution with mean 0 and standard variation 1. Then the threshold for which \(\hat{x}\) we would accept as being under \(P_0\) is 1.64. Below that, we accept the null hypothesis and above that we reject it.

In the plot below, if \(X\) truly is distributed as \(P_0\) we would accept an outcome with a probability that corresponds to the blue area, 95% of the total probability, and we would reject an outcome with a probability that corresponds to the orange area, 5% of the total probability.

This brings me to my two points at the top.

Notice that \(P_1\) isn’t used when testing if \(X\) is distributed as \(P_0\) or \(P_1\). We only use \(P_0\) to make that decision. We just prefer \(P_0\) as long as it is likely that \(X\) follows that distribution — meaning that *if* it does we want \(\hat{x}\) in the low 95% or 99% probability range under \(P_0\) — but we don’t consider the probability of \(\hat{x}\) under the alternative distribution \(P_1\).

Let us add an alternative distribution \(P_1\), say a N(2,1) distribution, to the plot:

If the null distribution is the true distribution for \(X\) we would still accept \(P_0\) 95% of the time and reject it 5% of the time, but if \(P_1\) is the true distribution, then we would reject \(P_1\) for what amounts to the orange area and accept \(P_1\) for what amounts to the blue area.

Is it reasonable to reject all the outcomes in the orange area? It is not even the case that \(P_0\) is the most likely in all of that area. In the range from 1 to 1.64, we expect more outcomes from \(P_1\) (the blue plus the orange area in the plot below) than from \(P_0\) (the orange area below).

If I were a betting man, I probably would be on \(P_0\) to the left of 1 and on \(P_1\) to the right of 1.

Mind you, this is not completely unreasonable to *a priori* prefer \(P_0\). We usually pick $$P_0$$ to be the most parsimonious hypothesis, so we *do* want to prefer it unless evidence goes against it. So in a sense, there is nothing wrong with *a priori* preferring one hypothesis over another, but does it make sense to completely *ignore* the alternative hypothesis when deciding if it is less or more likely to be true than the null hypothesis?

I’ll get back to this in “Bayesian hypothesis testing” below, but first there are a few more points I want to make first…

The title really says it all. P-values are uniformly distributed (under the null hypothesis).

The outcomes are not. It is not that each outcome \(\hat{x}\) is equally likely under \(P_0\), but the distribution of *p-values* of the outcomes are uniform.

To see this, consider the probability of having a p-value in a small range \(\left[p,p+\Delta p\right]\). That p-values are uniform means that the probability of hitting this interval is \(\Delta p\) (since the full range of p-values is 0 to 1).

The p-value interval corresponds to an x interval \(\left[x-\Delta x,x\right]\) so \(p=P(X\geq x)\) and \(p+\Delta p=P(X\geq x-\Delta x)\). So to hit the p-value interval we need \(\hat{x}\in[x-\Delta x,x]\) which happens with probability

$$\begin{array}{rcl}

P(x-\Delta x \leq X \leq x) = P(X \geq x-\Delta x)-P(X \geq x)\

= \left(p+\Delta p\right) – p = \Delta p

\end{array}$$

What does that mean for our hypothesis testing?

We often think of small p-values as stronger evidence against the null hypothesis, but the math doesn’t really support that. Under the null distribution, a p-value of 10^{-8} is *exactly* as likely as a p-value of 0.99.

A p-value doesn’t tell you *anything* about the probability of the null hypothesis being true! Small or large, it doesn’t matter!

The only reason that p-values are not completely worthless is that they are not uniformly distributed under the *alternative *distribution. If you consider the plots above, you’ll see that we expect more high x values under \(P_1\) than \(P_0\) which means that if \(X\) is really distributed as \(P_1\) it is more likely to get a small p-value than it is under \(P_0\).

Not that we consider that in any quantitative sense when deciding whether to believe in the null or the alternative hypothesis in a hypothesis test. There we just go for \(P_1\) for small p-values and \(P_0\) for large p-values, regardless of the distribution of p-values under \(P_1\).

By now I might have given you the impression that we never, ever, care about \(P_1\). In the spirit of absolute honesty, I should say that, while we completely ignore it when testing it against \(P_0\), we *do* care about it when setting up an experiment.

When we set up an experiment, we *do* care about the alternative hypothesis. At least we should, if we want to avoid wasting our time on the experiment.

We do what is called a *power study*, to figure out our chance of rejecting $$P_0$$ assuming this time that the *alternative hypothesis* \(P_1\) is true. Remember that the p-values are *not* uniformly distributed if \(P_1\) is the true distribution for \(X\), so we can consider the probability of getting a significant p-value when \(P_1\) is true, that is we can figure out what the probability is of choosing \(P_1\) when it *is in fact* the true distribution.

We use this to design our experiment. Not that we can do much about true underlying distributions (assuming such exists), but we can tweak our distributions \(P_0\) and \(P_1\) to give us a reasonable chance of choosing the right one after the experiment. If we really do several experiments, but average the outcomes, we decrease the variance in the outcomes and thus reduce the overlap of the two hypotheses. This way we can pick the number of samples we need to obtain any given success probability of choosing \(P_1\) assuming it is true.

So we can use the alternative hypothesis to design our study. After we design our study, however, we completely forget about \(P_1\) and test the hypothesis based only on p-values; p-values that are only based on \(P_0\).

Getting back to hypothesis testing, let’s say we have conducted our experiment and obtained the value \(\hat{x}\). Our two alternatives, from which we have to choose, are whether the value was obtained from a \(P_0\) or a \(P_1\) distribution.

If we use our p-value approach, we have a threshold, say 5%, so if \(P_0\) is true we know we will get it right 95% of the time and wrong 5% of the time. If, on the other hand, \(P_1\) is true, we have done our power analysis and found that with some probability \(\beta\) we choose correctly and with \(1-\beta\) we choose incorrectly.

What is the probability that we choose correctly?

You won’t be able to answer that, I’m afraid. Essential information is missing. Information that we intentionally ignore, because that is what you do if you take this approach to hypothesis testing.

We completely ignore the probability of \(P_0\) or \(P_1\) being true *a priori*. That is, what is the probability that \(\hat{x}\) was the outcome of a \(P_0\) or \(P_1\) process in the first place.

Let us denote the “outcome” that \(X\) is drawn from \(P_0\) as \(H_0\) and similarly let \(H_1\) denote that \(X\) was drawn from \(P_1\). Let \(A_0\) denote the outcome that we accept \(P_0\). Now, to get the probability of correctly identifying the distribution after observing an outcome, we can use Bayes’ formula:

$$P(H_0\,|\,A_0) = \frac{P(A_0\,|\,H_0)\cdot{}P(H_0)}{P(A_0)}$$

where

$$P(A_0) = P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_0\,|\,H_1)\cdot{}P(H_1)$$

If we use a 5% critical value we have \(P(A_0\,|\,H_0)=0.95\), and let’s assume that our power analysis gave us \(P(A_0\,|\,H_1)=0.20\) (so we have 80% chance of choosing \(P_1\) when it is true and 20% of getting it wrong when it is true).

This gives us

$$P(H_0\,|\,A_0) = \frac{0.95\cdot{}P(H_0)}{0.95\cdot{}P(H_0)+0.20\cdot{}(1-P(H_0))}$$

using the assumption that either \(P_0\) or \(P_1\) is true, so \(P(H_1)=1-P(H_0)\).

Similarly, we can obtain the probability of rejecting $$H_0$$ when it is in fact false:

$$P(H_1\,|\,A_1) = \frac{P(A_1\,|\,H_1)\cdot{}P(H_1)}{P(A_1\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)} = \frac{0.80\cdot{}(1-P(H_0))}{0.05\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))}$$

These two probabilities varies a lot as a function of \(P(H_0)\), so it is not surprising that you cannot answer the question without knowing the prior probability.

The overall success rate doesn’t vary quite as much because it is limited by construction. If no observations are ever drawn from \(P_0\) we will reject 5%, so our success rate is 95%. If no observations are ever drawn from \(P_1\) we will accept 80% of them (the success rate from our power analysis).

For values in between the two extremes, we have the formula, using that we choose correctly whenever we combine \(H_0\) with \(A_0\) and \(H_1\) with \(A_1\):

$$\begin{array}{rcl}

P(H_0,A_0)+P(H_1,A_1) & P(A_0\,|\,H_0)\cdot{}P(H_0)+P(A_1\,|\,H_1)\cdot{}P(H_1)\

& 0.95\cdot{}P(H_0)+0.80\cdot{}(1-P(H_0))

\end{array}$$

Notice that our success rate is always between 80% and 95%. If \(P(H_0)=0.99\) we would be better off *always* choosing \(H_0\) than using the p-value strategy.

The reason we ignore the prior probabilities is philosophical rather than mathematical. It is the old argument between Frequentist vs. Bayesian statistics. No one disagrees about Bayes’ formula, that is pure math, but some strongly disagree on whether you can put probabilities, especially prior probabilities, on our hypotheses.

Not that I think anyone would object to the analysis of successes above; the disagreement is whether we can use probabilities of hypotheses in the actual hypothesis test.

The prior probabilities are not based on observed data, and this is why some people find it dodgy to use. After all, it is little more than gut feeling that lets us choose it. Well, gut feeling and experience.

In my view, avoiding them is just weaseling out of an important problem of hypothesis testing. Prior probabilities are already implicitly there in the hypothesis test, just with default (and probably very wrong) values.

The critical value we use to determine the threshold for p-values can be thought of as an implicit prior weight on the hypothesis. Remember that I wrote above that we implicitly prefer the null hypothesis in the hypothesis test — to the point that we completely ignore the alternative hypothesis — by accepting values that fall in the 95% probability mass of this distribution and rejecting only 5% of the probability mass?

Well, it is not always that we *prefer* the null hypothesis this way. We really only “prefer” it if we choose the null hypothesis more often than we should. Whether this is the case depends on the prior probability. If \(P(H_0)\) is close to 1, pretty much all observations will be from the null distribution, but we are still going to reject 5% of these.

Even if \(P(H_0)\) is close to one half — so the two hypotheses are equally likely — whether we accept too many or two few observations depends on the overlap of the two hypotheses. If they overlap significantly, we are going to accept many \(H_1\) observations as \(H_0\)

while if the overlap is very small, we are going to reject too many \(H_0\) observation (since observations close to the critical value will all be \(H_0\) and practically never \(H_1\)).

Ignoring the distributions and just using a standard 5% or 1% p-value is almost guaranteed to be a sub-optimal choice.

Since the default thresholds are unlikely to be optimal, couldn’t we use the power analysis to pick an optimal threshold?

Indeed we can. It just requires that we know \(P(H_0)\) and \(P(H_1)\). For any critical value, \(c\), we have \(P(A_0\,|\,H_0)=P_0(X\geq c)\) and \(P(A_1\,|\,H_1)=P_1(X\geq c)\), so we can compute our success rate as

$$P(A_0,H_0)+P(A_1,H_1)=P_0(X\geq c)\cdot{}P(H_0)+P_1(X\geq c)\cdot{}P(H_1)$$

and optimise that with respect to \(c\).

There are still two problems with this approach.

First, we are picking a single threshold to choose between the two hypotheses, but that might not be the optimal approach. If, for example, \(H_0\) is most likely to the left of one critical value, and then again to the right of another critical value.

The second problem concerns the evidence for or against the two hypotheses, *after* we conduct our experiment.

If we conduct our experiment and get the outcome \(\hat{x}\), with a p-value, how do we interpret the p-value as strong or weak evidence for either of the hypotheses? Choosing the optimal critical value means that we have optimised our chance of making the right decision, but after our experiment we cannot directly interpret the evidence for or against.

Ideally, we want to know the probability of the hypotheses, taking the evidence into account, that is we want to know \(P(H_0\,|\,\hat{x})\) and \(P(H_1\,|\,\hat{x})\), but the p-value is not either of those two. It is something else; something that we cannot directly interpret.

If we are happy to use \(P(H_0)\) and \(P(H_1)\) to optimise our success rate, we have already broken the sacret rule of the frequentists, so we might as well go for the full Monty.

With a fully Bayesian approach, we *can* get the probabilities of the hypothesis *a posteriori*, that is, after our observations.

Since \(P(H_1\,|\,\hat{x})=1-P(H_0\,|\,\hat{x})\) we just need to work out \(P(H_0\,|\,\hat{x})\).

We get that from Bayes’ rule once more:

$$P(H_0\,|\,\hat{x})=\frac{P(\hat{x}\,|\,H_0)P(H_0)}{P(\hat{x})}$$

where

$$P(\hat{x})=P(\hat{x}\,|\,H_0)\cdot{}P(H_0)+P(\hat{x}\,|\,H_1)\cdot{}P(H_1)$$

Ok, just for the nitpickers: here I’m using \(P(\cdot)\) both as probabilities and densities. I know that, I just couldn’t bother introducing separate notation for densities. Just think of a small interval \(\hat{x}\in\left[x,x+\Delta x\right]\) — which is more sensible since we never measure \(\hat{x}\) with *absolute* accuracy anyway — and you should be fine…

In a Bayesian hypothesis test we will typically not do exactly this, but work with odds instead. We talk about the *posterior odds* \(P(H_1\,|\,\hat{x}) / P(H_0\,|\,\hat{x})\) that we can get as

$$\frac{P(H_1\,|\,\hat{x})}{P(H_0\,|\,\hat{x})}= \frac{P(\hat{x}\,|\,H_1)}{P(\hat{x}\,|\,H_0)}\times\frac{P(H_1)}{P(H_0)}$$

where \(P(\hat{x}\,|\,H_1)/P(\hat{x}\,|\,H_0)\) is called the *Bayes’ factor* and \(P(H_1)/P(H_0)\) the *prior odds.*

Whenever the posterior odds is greater than one, we should favour \(H_1\) and whenever it is smaller than one, we should favour \(H_0\). Since it can directly be interpreted as odds, we even have a quantitative measure of how strong the evidence is, for or against.

You can interpret the Bayes’ factor as the evidence the observed data brings to the table, for or against \(H_1\). The prior odds captures how likely we think it is that one or the other of the hypotheses are true in general (before we see the data).

You might not feel comfortable with the prior odds. You have no data to estimate these odds from. It is a subjective measure of how likely we think the hypotheses are, and different people might have different views on this.

Of course, once you have the Bayes’ factor from the data, people are free to use different prior odds to get the posterior odds. You still have a quantitative measure of the evidence for or against the hypotheses. It just depends on the prior belief in the two.

As I have argued above, the prior odds *are* important when deciding whether you believe in \(H_0\) or \(H_1\). If you ignore them, and use a default p-value, you just implicitly make an assumption about this.

Our p-value might give us too many false positives or two many false negatives, but we don’t know unless we consider the prior odds.

This is especially important when \(P(H_1)\ll P(H_0)\). Here a traditional p-value of 5% will *never* be a sensible choice, and you do need to consider how strong evidence you need from the data before you believe in the alternative hypothesis.