This is a followup on a post I wrote a few days ago, so you might want to read that before you continue here…
False positives are observations that are really from the null distribution, but have a p-value below the significance threshold, and so are still considered significant. As I wrote in the earlier post, if you increase the sample size you can increase your power – the probability that an observation from the alternative distribution is significant – but it will not decrease the chance of false positives.
When you increase the sample size, you decrease the sampling variance so it becomes easer to distinguish between the null distribution and the alternative distribution, but you also adjust the threshold for significans so it matches the significance threshold. If before you used 5% significane and would threfore consider 5% of the null distribution observations as significant, you will still accept 5% of them as significant.
That is how it works in theory. In practise there is a problem with false positives and sample size.
It is often the case that the mathematical null distribution is not really capturing the null hypothesis we are interested in. There is noise in the data that has nothing to do with either null or alternative hypothesis, but noise that moves the data away from the theoretical null distribution we use for our statistical test.
This is by no means a problem for all statistical studies – there are lots of cases where the null distribution does exactly capture what we are testing for – but it does occur often enough that it is worth keeping in mind when you design a study.
Genome wide association mapping
As before, I will use genome wide association mapping as an example.
The goal here is to examine the genome wide genetic variation in cases and controls for some disease, to identify the markers where there is an association between the genotype and the disease. The natural null distribution would then of course be no association and the alternative some association.
What happens if our sample of cases and controls is not taken from a single genetic homogene population, but sampled from two populations with some genetic difference?
The two populations would have different allele frequencies along the genome. Large differences at some markers, small at others, but we would expect some differences. If we are looking at a genetic disease, it is therefore natural to expect that the disease frequency varies as well, between the two populations: if some of the at-risk alleles are in higher frequency in one of the populations, then the population risk is higher.
If we sample cases and controls at random, one population would be over-represented in the cases and under represented in the controls, compared to the joint population as a whole.
If we test the genetic markers against the null distribution of no genetic difference between cases and controls, we are testing the wrong hypothesis! We should be testing for association between marker and disease, but since we cannot do that, we are testing if the allele frequencies are different between the cases and controls. It is the best we can do, of course, but the problem is that the null distribution is not actually capturing our null hypothesis.
With a population structure like this, we really do expect the cases and controls to have different allele frequencies in pretty much all markers. For those markers that really truely are associated with the disease, we expect the frequency differences to be more extreme – otherwise the exercise would be futile – but we do not really expect zero difference in the other markers.
The situation I have described might sound a bit artificial. After all, why not just sample all cases and controls from the same population? Or take the population structure into account when doing the test?
The problem is that we usually do not know that we are sampling from different populations. There is genetic variation within what we would consider the same population, if we sample from different geographic areas. And since people tend to move around, we would have a problem even if we sampled all individuals from the same small town, because some would have ancestors from different parts and that would give us genetic variation just from that.
Consider the plot below:
These show the distribution of test values from the WTCCC study. The gray areas show where we would expect the values to be if they were sampled under the null distribution, and in general the actual observations are above that.
In this study, all samples are from the UK and a priori from the same population. Unless a very large fraction of the tested markers are actually associated with the diseases – a very unlikely scenario – the null distribution simply does not match the null hypothesis.
They did actually test for population structure in that study, and clearly demonstrated that it was there, but you will have to read the paper for that.
Population structure is just one source of noise that can show up in genome wide association studies. There are plenty of others.
The point is not so much that population structure is a problem for association mapping – it is, and it is a big problem – the point is that the obvious null distribution does not match the null hypothesis.
Increasing the sample size
What happens if we increase the sample size?
We make it easier for those observations that are not from the null distribution to be found to be significant. If the data is from a mixture of two distributions – one that shows population differences and one that show population differences and also disease association – neither of which are the null distribution we test again, then we will see more and more significant results.
Mathematically this is as it should be. We are rejecting the null hypothesis when it is not true, and really we should be rejecting the null hypothesis for all the observations because none of them really are from the null distribution.
This doesn’t mean that increasing the sample size is a bad idea. Let me just make that absolutely clear. Increasing the sample size will make it easier to distinguish between makers that are associated with the disease and markers that are not.
A simple illustration is shown below, where I plot two normal distributions – red and blue – against a null distribtution – black dashed line. As the variance decreases, samples from the red and blue densities will become easier to distinguish, but both blue and red density will also become easier to distinguish from the null distribution.
The problem is just that if you use the null distribution to pick significant values, you will underestimate the number of false positives – compared to your actual null hypothesis, not the null hypothesis you actually test against – and this error will increase with the sample size.
When you need very large sample sizes to see the true signals in the data – as you do for genome wide association tests – this becomes a real problem. Even if you correct for the large number of tests – and therefore control for the number of false positives you expect to see – you probably still will see lots of false positives. Many more than you would expect if the false positives were really sampled from the null distribution you use in your tests.