True positives and false positives
Thursday, July 2nd, 2009Following up on my rant on p-values, I want to say something about true and false positives in a classical statistical hypothesis test.
Such a test works as follows: we observe a value – assumed to be from some stochastic distribution – and calculate a p-value. We then check if that p-value is below some significance level – typically 5% – and if it is we say that it is “significant” (or a positive) and if it is not we say that it is “not significant” (or a negative).
That is all there is to it.
Now, the p-values says something about the probability of false positives, those observations that are really sampled from the null distribution but are judged positives by the test. With a 5% significance level, we expect that 5% of values truly sampled from the null distribution will be significant. In other words, if we make a number of observations, and all our observations are from the null distribution, then we should get positives for about 5% of our tests.
Of course we do not really believe that all our observations are from the null distribution. If we did, we wouldn’t make any tests. What we actually believe is that we have some mixture of values from either the null distribution,
or some alternative distribution,
, so our values are drawn from
where
is the probability that the sample is from the null distribution and
is the probability that it is from the alternative distribution.
An example: association mapping
In association mapping we test genetic markers, to figure out if any marker is associated with some a disease. We do not believe that all markers are associated with the disease, nor do we believe that none of them are. In the mixture above,
would be the (prior) probability that any given marker is not associated with the disease, while
would be the probability that it is associated with the disease.
We will simplify the setup a bit. We will consider a marker where one allele is found in 40% of the population, say that those without that marker have a fixed disease risk while those with the marker has an increased risk, a genetic relative risk, GRR, such that if the wild types have risk
then the mutants (those with the 40% allele) have the risk
. We sample
individuals, categorise them according to allele and according to disease, and then make a
test for association.
create.sample <- function(N, mutant.freq, wt.risk, GRR) {
# Sample N individuals where N*mutant.freq are mutants. Assign
# cases and controls status based on the wildtype risk wt.risk and
# the mutant risk based on the genetic relative risk GRR*wt.risk.
mutant.status <- runif(N) < mutant.freq
disease.status <- sapply(mutant.status,
function (m)
ifelse(m,
runif(1) < GRR*wt.risk,
runif(1) < wt.risk))
ftable(mutant.status, disease.status)
}
data <- create.sample(2000,0.4,0.05,2)
chisq.test(data)
So here I sample 2000 individuals, say that the wildtype risk is 5% and that the mutants have a GRR of 2, so twice the risk of the disease.
Since this sample has a GRR of 2 – twice the risk for mutants – we are sampling from the alternative distribution. So if the p-value is significant we have a true positive – a signal where there should be one – while if the p-value is not significant we have a false negative – no signal where there should be one.
If, instead, we sample with a GRR of 1 – the mutant risk is the same as the wild type – a significant value would be a false negative while a non-significant p-value would be a true negative.
With me so far? Good.
Now we can set up the mixture from above.
sample.pvalue <- function(N,mutant.freq,wt.risk,GRR) {
chisq.test(create.sample(N,mutant.freq,wt.risk,GRR))$p.value
}
sample.mixture <- function(true.risk, n, N, mutant.freq, wt.risk, GRR) {
associated <- runif(n) < true.risk
p.values <- sapply(associated,
function(t)
ifelse(t,
sample.pvalue(N,mutant.freq,wt.risk,GRR),
sample.pvalue(N,mutant.freq,wt.risk,1)))
significant <- p.values < 0.05
ftable(significant, true.or.false)
}
The real setup for association mapping is somewhat more complex, since there we never sample new individuals for each marker – so the tests are not independent – and the allele frequency varies for each marker and so on. No matter, this setup is good enough for a blog.
Sampling true and false positives and negatives
Ok, now we can try sampling from the mixture.
First, we can try
. That is, we can try sampling from a distribution of markers that are guaranteed not to be associated with the disease.
> sample.mixture(0, 1000, 2000, 0.4, 0.05, 2) associated FALSE significant FALSE 964 TRUE 36
We only get “false” signals – since none of the markers are associated with the disease, but we still get some positives. False positives, of course. We expect about 5% – so 50 since we sample 1000 markers. We get 36 instead of 50, since this is a random process, but you get the point.
If, instead, we sample with
we only see associated markers, but we still test them, so some of them might end up being classified as negatives. False negatives, of course.
> sample.mixture(1, 1000, 2000, 0.4, 0.05, 2) associated TRUE significant FALSE 10 TRUE 990
The significance level doesn’t tell you anything about the expected number of false negatives under the alternative distribution. In a classical hypothesis test, the alternative distribution is completely ignored. Only the null distribution matters. This is one of the reasons I don’t particularly like this type of tests, but that is a different story.
The point is just, that even with all values “true” we still see some negatives.
With a mixture of associated and not associated markers, say 50%/50%, we expect both associated and not associated markers and both positives and negatives, of course.
> sample.mixture(0.5, 1000, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 490 5 TRUE 20 485
We want to find as many true positives as possible, while getting as few false positives as possible. We can only see the signifcance status, of course, so we cannot tell which of the positives are true or false.
In the case above, we find as significant 485 of the associated markes – and miss five of them – and we get only 20 false positives. Not bad, but then, 50% of the markers were true associations to begin with.
What if
?
> sample.mixture(0.1, 1000, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 876 2 TRUE 38 84
or
?
> sample.mixture(0.05, 1000, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 893 2 TRUE 49 56
We still do okay at picking up the associated markers. In both cases we only miss two of them (but of course the actual number depends on the randomness in the sample). We just see fewer of the true associations in the tests.
The ratio of true to false positives changes as well. We still only expect 5% of the non associated markers to be significant, but there are just more of them now, and if we reduce the frequency of associated markers to 1% we now see more false than true positives:
> sample.mixture(0.01, 1000, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 953 0 TRUE 43 4
In a real association mapping study, we don’t expect 1% of the markers to be associated with the disease. If we test a million markers, we don’t expect more than a few tens to a few hundreds to really be true associations, so
is pretty low indeed.
With a significance value of 5% the true positives would completely drown in the sea of false positives.
Power
Formally, statistical power is the probability of getting a significant p-value when the sample is from the alternative distribution. It is the dual to the significance value.
The significance value doesn’t say anything about what happens to observations from the alternative distribution. The power doesn’t say anything about what happens to values from the null distribution.
It just tells you how likely it is that you detect values from the alternative distribution as significant.
In our example above, the power is pretty good. We recognize pretty much all of the associated markers as significant.
With a sample size of 2000, a high risk allele frequency and a relative risk of 2, we get a pretty strong signal.
Our only real problem is that we still detect a lot more false associations than true associations.
Boosting the power, which essentially amounts to increasing the sample size since that is the only variable we are in control of, doesn’t help on this at all!
If you increase the sample size, you still get 5% of the false associations as significant. That is just how the significance value works.
The only way to reduce the number of false positives is to lower the significance value. If you pick 1% you only get 1% of the false associations as significant. With 0.1% you only get 0.1%.
This is why we do multiple test correction.
Multiple test correction
Multiple test correction really just means lowering the significance threshold. There are different ways of doing it, but it pretty much all amounts to figuring out how much to reduce the significance threshold down to a level where you expect few false positives.
Let’s try it out. We update our code
sample.mixture <- function(true.risk, n, sig.level, N, mutant.freq, wt.risk, GRR) {
associated <- runif(n) < true.risk
p.values <- sapply(associated,
function(t)
ifelse(t,
sample.pvalue(N,mutant.freq,wt.risk,GRR),
sample.pvalue(N,mutant.freq,wt.risk,1)))
significant <- p.values < sig.level
ftable(significant, associated)
}
and sample again. Say with a significance level of 0.01%:
> sample.mixture(0.01, 1000, 0.0001, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 989 6 TRUE 0 5
We reduce the number of false positives – which is good – but we also reduce the number of true positives – which is bad.
If we reduce the significance value, we also make it harder for samples from the alternative distribution to get p-values below the threshold. We reduce the power.
The example above is not actually that bad. We still find half of the true associations. But then, a significance level of 0.01% is actually a bit high for a genome wide association study. If, there, we test 1 million markers, and are willing to get around 1 false positives, we should use a significance value of one in a million.
> sample.mixture(0.01, 1000, 1e-6, 2000, 0.4, 0.05, 2) associated FALSE TRUE significant FALSE 986 12 TRUE 0 2
The more tests you make, the lower a significance value you need, if you want to keep the expected number of false positives constant.
Of course, what you really want is not to keep the number of false positives fixed but rather to get a good ratio of false to true positives, but I’ll leave that for another post.
The point here is just that if you want to keep the number of false positives down, you will end up reducing your power as well.
Of course, now you have a reason for increasing the sample size. That will make the alternative distribution less similar to the null distribution and therefor more likely to get small p-values. It still won’t do anything to the distribution of p-values from the null distribution, but it will change the distribution of p-values from the alternative distribution.
I wanted to say something about the case where the “actual” null distribution is not really the mathematical null distribution, but this post is getting pretty long already, so I think I’ll leave that for another post, so stay tuned…
–
183-186=-3
, distributed as
under the null hypothesis, and if you observe the value
, then
(this is a one sided p-value just for convenience).
we expect that a fraction of
significant observations. By chance. They are false positives since we consider them significantly different from what we “would expect” even if they behave exactly as expected.
.