Posts Tagged ‘p-values’

True positives and false positives

Thursday, July 2nd, 2009

Following up on my rant on p-values, I want to say something about true and false positives in a classical statistical hypothesis test.

Such a test works as follows: we observe a value – assumed to be from some stochastic distribution – and calculate a p-value.  We then check if that p-value is below some significance level – typically 5% – and if it is we say that it is “significant” (or a positive) and if it is not we say that it is “not significant” (or a negative).

That is all there is to it.

Now, the p-values says something about the probability of false positives, those observations that are really sampled from the null distribution but are judged positives by the test.  With a 5% significance level, we expect that 5% of values truly sampled from the null distribution will be significant.  In other words, if we make a number of observations, and all our observations are from the null distribution, then we should get positives for about 5% of our tests.

Of course we do not really believe that all our observations are from the null distribution.  If we did, we wouldn’t make any tests.  What we actually believe is that we have some mixture of values from either the null distribution, P_0 or some alternative distribution, P_1, so our values are drawn from \pi_0P_0 + \pi_1P_1 where \pi_0 is the probability that the sample is from the null distribution and \pi_1=1-\pi_0 is the probability that it is from the alternative distribution.

An example: association mapping

In association mapping we test genetic markers, to figure out if any marker is associated with some a disease.  We do not believe that all markers are associated with the disease, nor do we believe that none of them are.  In the mixture above, \pi_0 would be the (prior) probability that any given marker is not associated with the disease, while \pi_1 would be the probability that it is associated with the disease.

We will simplify the setup a bit.  We will consider a marker where one allele is found in 40% of the population, say that those without that marker have a fixed disease risk while those with the marker has an increased risk, a genetic relative risk, GRR, such that if the wild types have risk r then the mutants (those with the 40% allele) have the risk GRR\cdot r.  We sample N individuals, categorise them according to allele and according to disease, and then make a \chi^2 test for association.

create.sample <- function(N, mutant.freq, wt.risk, GRR) {
  # Sample N individuals where N*mutant.freq are mutants.  Assign
  # cases and controls status based on the wildtype risk wt.risk and
  # the mutant risk based on the genetic relative risk GRR*wt.risk.
  mutant.status <- runif(N) < mutant.freq
  disease.status <- sapply(mutant.status,
                           function (m)
                           ifelse(m,
                                  runif(1) < GRR*wt.risk,
                                  runif(1) < wt.risk))
  ftable(mutant.status, disease.status)
}

data <- create.sample(2000,0.4,0.05,2)
chisq.test(data)

So here I sample 2000 individuals, say that the wildtype risk is 5% and that the mutants have a GRR of 2, so twice the risk of the disease.

Since this sample has a GRR of 2 – twice the risk for mutants – we are sampling from the alternative distribution.  So if the p-value is significant we have a true positive – a signal where there should be one – while if the p-value is not significant we have a false negative – no signal where there should be one.

If, instead, we sample with a GRR of 1 – the mutant risk is the same as the wild type – a significant value would be a false negative while a non-significant p-value would be a true negative.

With me so far?  Good.

Now we can set up the mixture from above.

sample.pvalue <- function(N,mutant.freq,wt.risk,GRR) {
  chisq.test(create.sample(N,mutant.freq,wt.risk,GRR))$p.value
}

sample.mixture <- function(true.risk, n, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < 0.05
  ftable(significant, true.or.false)
}

The real setup for association mapping is somewhat more complex, since there we never sample new individuals for each marker – so the tests are not independent – and the allele frequency varies for each marker and so on.  No matter, this setup is good enough for a blog.

Sampling true and false positives and negatives

Ok, now we can try sampling from the mixture.

First, we can try \pi_1=0.  That is, we can try sampling from a distribution of markers that are guaranteed not to be associated with the disease.

> sample.mixture(0, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE
significant
FALSE                       964
TRUE                         36

We only get “false” signals – since none of the markers are associated with the disease, but we still get some positives.  False positives, of course.  We expect about 5% – so 50 since we sample 1000 markers.  We get 36 instead of 50, since this is a random process, but you get the point.

If, instead, we sample with \pi_1=1 we only see associated markers, but we still test them, so some of them might end up being classified as negatives.  False negatives, of course.

> sample.mixture(1, 1000, 2000, 0.4, 0.05, 2)
            associated TRUE
significant
FALSE                       10
TRUE                       990

The significance level doesn’t tell you anything about the expected number of false negatives under the alternative distribution.  In a classical hypothesis test, the alternative distribution is completely ignored.  Only the null distribution matters.  This is one of the reasons I don’t particularly like this type of tests, but that is a different story.

The point is just, that even with all values “true” we still see some negatives.

With a mixture of associated and not associated markers, say 50%/50%, we expect both associated and not associated markers and both positives and negatives, of course.

> sample.mixture(0.5, 1000, 2000, 0.4, 0.05, 2)

 associated FALSE TRUE
significant
FALSE                       490    5
TRUE                         20  485

We want to find as many true positives as possible, while getting as few false positives as possible.  We can only see the signifcance status, of course, so we cannot tell which of the positives are true or false.

In the case above, we find as significant 485 of the associated markes – and miss five of them – and we get only 20 false positives.  Not bad, but then, 50% of the markers were true associations to begin with.

What if \pi_1=0.1?

> sample.mixture(0.1, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    876    2
TRUE                      38   84

or \pi_1=0.05?

> sample.mixture(0.05, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    893    2
TRUE                      49   56

We still do okay at picking up the associated markers.  In both cases we only miss two of them (but of course the actual number depends on the randomness in the sample).  We just see fewer of the true associations in the tests.

The ratio of true to false positives changes as well.  We still only expect 5% of the non associated markers to be significant, but there are just more of them now, and if we reduce the frequency of associated markers to 1% we now see more false than true positives:

> sample.mixture(0.01, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    953    0
TRUE                      43    4

In a real association mapping study, we don’t expect 1% of the markers to be associated with the disease.  If we test a million markers, we don’t expect more than a few tens to a few hundreds to really be true associations, so \pi_1 is pretty low indeed.

With a significance value of 5% the true positives would completely drown in the sea of false positives.

Power

Formally, statistical power is the probability of getting a significant p-value when the sample is from the alternative distribution.  It is the dual to the significance value.

The significance value doesn’t say anything about what happens to observations from the alternative distribution.  The power doesn’t say anything about what happens to values from the null distribution.

It just tells you how likely it is that you detect values from the alternative distribution as significant.

In our example above, the power is pretty good.  We recognize pretty much all of the associated markers as significant.

With a sample size of 2000, a high risk allele frequency and a relative risk of 2, we get a pretty strong signal.

Our only real problem is that we still detect a lot more false associations than true associations.

Boosting the power, which essentially amounts to increasing the sample size since that is the only variable we are in control of, doesn’t help on this at all!

If you increase the sample size, you still get 5% of the false associations as significant.  That is just how the significance value works.

The only way to reduce the number of false positives is to lower the significance value.  If you pick 1% you only get 1% of the false associations as significant.  With 0.1% you only get 0.1%.

This is why we do multiple test correction.

Multiple test correction

Multiple test correction really just means lowering the significance threshold.  There are different ways of doing it, but it pretty much all amounts to figuring out how much to reduce the significance threshold down to a level where you expect few false positives.

Let’s try it out.  We update our code

sample.mixture <- function(true.risk, n, sig.level, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < sig.level
  ftable(significant, associated)
}

and sample again.  Say with a significance level of 0.01%:

> sample.mixture(0.01, 1000, 0.0001, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    989    6
TRUE                       0    5

We reduce the number of false positives – which is good – but we also reduce the number of true positives – which is bad.

If we reduce the significance value, we also make it harder for samples from the alternative distribution to get p-values below the threshold.  We reduce the power.

The example above is not actually that bad.  We still find half of the true associations.  But then, a significance level of 0.01% is actually a bit high for a genome wide association study.  If, there, we test 1 million markers, and are willing to get around 1 false positives, we should use a significance value of one in a million.

> sample.mixture(0.01, 1000, 1e-6, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    986   12
TRUE                       0    2

The more tests you make, the lower a significance value you need, if you want to keep the expected number of false positives constant.

Of course, what you really want is not to keep the number of false positives fixed but rather to get a good ratio of false to true positives, but I’ll leave that for another post.

The point here is just that if you want to keep the number of false positives down, you will end up reducing your power as well.

Of course, now you have a reason for increasing the sample size.  That will make the alternative distribution less similar to the null distribution and therefor more likely to get small p-values.  It still won’t do anything to the distribution of p-values from the null distribution, but it will change the distribution of p-values from the alternative distribution.

I wanted to say something about the case where the “actual” null distribution is not really the mathematical null distribution, but this post is getting pretty long already, so I think I’ll leave that for another post, so stay tuned…

183-186=-3

What is a p-value

Wednesday, July 1st, 2009

One thing that shocked me in the last three days exams was the students’ understanding of p-values.  Not that all of them misunderstood them, not by far, but some had a very flawed understanding, and the mind really boggles at how they can think what they do and still use p-values the way they do…

I’m not really a fan of p-values myself, for the reasons I wrote about in great details before, but p-values are probably here to stay so people really need to understand this if they want to do any kind of statistics!

What is a p-value

It can be a bit tricky to get the definition right, so I’ll just quote Mathew Stephens:

A p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

So if you have a stochastic variable X, distributed as P(X) under the null hypothesis, and if you observe the value x, then p=P(X\geq x) (this is a one sided p-value just for convenience).

One of the reasons it is a bit tricky is that it is not itself a random variable.  Once you have observed x it is fixed.  It just tells you how likely it is that you observe something more extreme than what you actually observed.

If you repeat the experiment and observe another x then that would have another p-value, so in that sense there is some stochastic behavior, but only in that sense.

Actually, if you repeat the experiment lots of times, and all the observations are actually from the null distribution, then the p-values will be uniformly distributed between 0 and 1.  This is why, if we have a significance threshold of \alpha we expect that a fraction of \alpha false positives will be observed.  That is, if we sample N values from the null distribution, we will get \alpha\cdot N significant observations.  By chance.  They are false positives since we consider them significantly different from what we “would expect” even if they behave exactly as expected.

What is a p-value not

A p-value is not the probability that the null hypothesis is true.  It really isn’t.

The p-values are uniformly distributed if we sample values from the null distribution, so how can it be?  All p-values are exactly equally likely under the null hypothesis.  It is not any more likely to get a p-value of 0.99 than it is of getting a p-value of 0.01, if the observations are really from the null distribution.

If you think that this distinction on what a p-value really is, is somewhat technical and not really that important, then let me ask you this: if the p-value is really the probability that the null hypothesis is true then why do we typically only reject the null hypothesis if the p-value is below 5%?

If the p-value was really that, wouldn’t we go for all p-values below 50%?

Those would be the observations where it is more likely that the observation is from the alternative distribution than from the null distribution.  If you had to make a bet on whether the observation is from the null distribution or the alternative, you would be wrong 95% of the time if you only picked those where the p-value is below 0.05!

This is where the mind boggles.

If you really believe that the p-value is the probability that the null is true, then how can you go ahead and only bet on that when the p-value is below 0.05?

It is not some subtle definition here, what you are doing if you really believe that the p-value is the probability that the null model is true is just plain stupid.  You are making bets against common sense.

True, if somehow false positives are much more expensive than false negatives you might have a point here, but if not you are just making a lot more mistakes than you could be making.

If you pick the significant values from the 5% significance threshold, and p-values really are the probability that the null is false, then you would end up with 95% true positives among those you select.

If that is what you are aiming for, then what you are doing makes sense. It is not what you will get – because p-values are just not what you think they are – but at least it makes sense. It does mean that only 5% of those values you pick would be false positives, though, so in the downstream analysis you shouldn’t explain away more than 5% as “probably false positives”.  That would be inconsistent.

By now, if a little late in the post, I should probably say that this confusion came up in the context of genome wide association studies.  Here you expect that by far the most genetic variation has no relation to any given phenotype, so if you look at genotypic variation and its association with any given phenotype, you would expect very few variations to be true positives.

If, in this context, you use a 5% significance threshold, you should – again, assuming your understanding of p-values is correct – find 95% true positives and 5% false positives.

If that is true, would you do a multiple test correction?

Wouldn’t it be just fine if you had 95% true positives and 5% false positives?

Would you really try to explain away the wast majority of your “hits” as false positives if you only expect 5% of them really to be false positives?

It seems a bit inconsistent to me…

p-values and evidence

If you ever hear yourself saying that low p-values are “more significant” than high p-values, stop yourself!

Not that it is wrong, as such, but here the argument really is more subtle.

Under the null hypothesis, any p-value is exactly equally likely.  They are uniformly distributed.

Under the null hypotheis, it is exactly as likely to observe a p-value of 0.99 as a p-value of 10^{-99}.

If you say “more significant” you are just wrong.  A p-value is either significant or not, since it boils down to whether it si under the significance threshold or not.

If you say that it is more likely to be a true positive than a false positive, you might be right, but that has to do with the distribution of p-values under the alternative hypothesis.

The reason we look at p-values in the first place is because we think that extreme values of x are more likely under the alternative distribution than under the null distribution.

If something is unlikely under the null distribution, it might be less unlikely under the alternative distribution.  That is why we are interested in small p-values.  It has nothing to do with how likely or unlikely they are under the null distribution.

A quick comment on p-values and sample size

Your power to detect true positives over false positives is related to sample size, in the sense that the more data you have the more likely it is that you can detect differences between the null and the alternative distribution.

But in what way?

There seems to be some confusion here as well.

If we use the right definition of p-values, then you would expect 5% false positives if you use a threshold of 5%.  That is what a threshold of 5% means.

If you increase your sample size, would that reduce the number of false positives?  No, it wouldn’t!  Regardless of the sample size, the threshold is chosen such that you expect 5% false positives.  The actual threshold values will chance, but not the fact that you expect 5% false positives.  That is what the 5% significance threshold is.  It is its being; its purpose in life; what it really is.

By increasing the sample size, the best you can hope for is that more true positives makes it below the threshold.  So among the positives you will get a larger fraction of true positives over false positives.  The absolute number of false positives cannot change unless you change the threshold.

If you sample values from the null distribution, and you use a 5% significance threshold, you get 5% false positives.

The sample size really only matters if you consider a mixture of the null distribution and the alternative distribution.

How values are distributed under such a mixture depends on the sample size, and the two distributions are easier to distinguish between the larger the sample size (hopefully, at least).  You will always get 5% of the null distribution values with a 5% threshold, but you might get more and more of the alternative distribution values if you increase the sample size.

A quicker comment on p-values and sample size

This is a bit of a trickier point.

In real life, you would not actually expect the same number of false positives if you increase the sample size.

Yes, mathematically, if you have a 5% threshold you would get 5% false positives regardless of the sample size, because 5% of the samples from the null distribution would fall below the threshold.

In real life, the number of false positives would increase if you increase the sample size.

What???

What happens is this: In the real world, no simple null hypothesis is true.  The world is a lot more complicated than simple statistical models.

If, in an association study, your null hypothesis is that there is no genetic difference between cases and controls, you are really saying that there is no genetic difference between cases and controls.  But there will be!  If you sample at random, that is true, but you probably don’t.  There will be subtle differences from sources unrelated to the disease.

Let me give you a simpler example.  Consider tossing a die.  The null hypothesis would be that all six sides are equally likely.

If we have two dice, one that is loaded and one that is not, we could be looking for values from the one that is loaded.

If we throw each die once and record the result, we wouldn’t be able to distinguish between them.  I mean, if one is a five and the other is three, which do you think is loaded?

If, on the other hand, we throw them 100 times and notice that one of them hits 6 half of the time, we would be pretty sure that that is the loaded die.

Now, if we throw the dice a milion times, and test if each side is equally likely, chances are that we would conclude that both the dice are loaded.  Because the “fair” die is actually loaded, just less so than the “loaded” die.

The “one” side is heavier than the “six” side, because how the eyes are drilled.  That will show up in a milion throws of the die.  Probalby not in 100 throws, but with enough throws it will.

The mathematical model of a die does not match reality.  The null model is not true.

The null model is what we test for, and unless it is exactly true, we will reject it with enough samples.

As we increase the sample size, we will get more and more false positives.

The only reason that increasing sample size is sensible at all is because the distribution of the values under the alternative hypothesis moves away from the theoretical null distribution faster than those that are closer to the mathematical null distribution.

With increased sample size, the fraction of true positives over false positives will improve – up to a point – but the absolute number of false positives will actually improve.

In most cases not something to worry about, but with very large sample sizes and simple models you probably should.

I have rejected a few papers based on this, actually, where the result is rejecting a simple linear model based on an enormous sample size.  Such results are really to be expected, just by chance, because the simple models are never true…

182-182=0