Association mapping and local genealogies
For a while I've been wanting to write a bit about my approach to association mapping, but I'd like to put it into a larger perspective, so I will instead write about how "local genealogies" can be used. This is a large topic, so I will split it into several posts over the coming weeks. At the end, I might collect it all in a small booklet, if there is interest for that.
This is the first part of the series, and I'll use it to introduce and motivate the use of local genealogies.
Single marker approaches
In association mapping, we look for an association between a genotype and a phenotype. The simplest situation is where we check if a single genetic marker is associated with the phenotype. We can group our sampled individuals according to their genotype at the marker, and then examine the distribution of phenotypes in each group. If the distribution differs (statistically significant) between genotypes, then that particular marker is associated with the phenotype.
An example will make this clearer. Consider a binary phenotype: either you are affected with the disease or you are unaffected. Then consider the common case where the marker of interest is a single nucleotide polymorphism (SNP) with two different alleles, say A and a. This gives us a contingency table like the one shown on the left.
The association test is the statistical test checking of the Affected row in the table follows the same distribution as the Unaffected row. This is a well-known problem in statistics that can be checked using e.g. Pearson's chi-square test or Fisher's exact test.
Sometimes, when there is reason to believe that the disease has an an additive genetic component, meaning that the "at risk" allele, say A, increases the risk twice as much when homozygotic compared to heterzygotic, a simpler table that only groups by allele is used (see figure on the left). If the disease really does behave this way, the simpler table is more powerful in detecting the association in the statistical test.
The underlying statistical problem is the same, however: testing if the genotype and phenotype are independent or not.
If the Affected row has a different distribution that the Unaffected row, then the rows and columns are not independent and there is an association between genotype and phenotype. This is the kind of markers we are looking for.
Mind you, a test like this is saying nothing about causality: the test is not saying that having one genotype rather than another actually increases your disease risk. It is only saying that in the population as such there is a correlation between the marker and the disease. It is a good starting point to find out about the genetic component of the disease, but nothing more.
Searching for associated markers
This is how we test if a given single marker is associated with the disease. When we go searching for markers associated with the disease, we simply test every marker we have genotyped for association in this way.
Do we then need to test all 3 billion nucleotides in our genome? No, not really. First of all, there is no real variation in the vast majority of nucleotides, so there is nothing to test there. Secondly, the nucleotides are correlated, so by testing some of the nucleotides, we will also learn about potential association with others and we do not need to test these explicitly then.
We can "tag" unobserved SNPs with a subset of the existing SNPs -- unimaginatively called "tagSNPs". This way we indirectly survey the entire genome, but we can get away with only testing 500,000 to 1 million SNPs.
We need to be careful in how we select these tagSNPs, though. To select them, we need to know how they are correlated with those we do not test. This correlation we have learned from large studies such as the HapMap project. An important measure for the correlation between SNPs is the so-called r2 measure, that is just the square of the statistical correlation between them. If we pick tagSNPs such that the untyped SNPs are in high r2 with the tagSNPs -- typically we aim for >0.8 -- we have a good chance of also picking up association by this indirect test.
Even if we can reduce the number of tests to a million, there is still a problem, though. The way statistical testing is done, we consider something a "hit" if we observe something that is unlikely unless there is a true signal. Unlikely, but not impossible. The definition of unlikely, in this case, could be that it only occurs 5% or 1% of the time by chance if there is no true signal.
If each test will report a falls hit with 1% probability, we expect 10,000 false hits if we perform 1 million tests. This is know as the multiple testing (or multiple comparison) problem, and we need to correct for this to reduce the number of false findings.
If the tests are completely independent, a conservative correction is the Bonferroni correction, but SNPs are correlated, even tagSNPs, so this test is not necessarily appropriate here. It is still commonly used, though, since the correct way of correcting for multiple tests in association mapping is still up for debate and far from well understood.
Considering only single markers at a time is simple and results are easy to interpret, but testing every marker independently does have some problems. The most serious problem is the statistical power to detect disease association indirectly through tag SNPs.
In Pe'er et al. (2006), they considered this problem and analysed how well the SNPs on commercial genotyping chips tagged other known, but untyped, SNPs.
They analysed which fraction of untyped SNPs from HapMap phase II or ENCODE, respectively, were tagged by the SNPs on the chip (for ENCODE distinguishing between SNPs with minor allele frequency, MAF, above 5% or 1%) -- see the figure on the left (Fig 2 from Pe'er et al.).
For Europeans (CEU) and Asians, (CHB+JPT) slightly more than half of the SNPs with MAF > 5% were tagged, and for Africans (YRI) slightly less than half the SNPs were tagged. For SNPs with MAF > 1% but < 5%, much fewer were tagged.
Quoting Pe'er et al.:
Even though the majority of common variants is captured by the current generation of genome-wide arrays, there is a substantial component of common variation not highly correlated to a SNP on each array.
This is when they tried to tag SNPs with individual markers. They also tried tagging SNPs using two or three typed markers to predict the untyped markers, and observed that using more markers improved the tagging (see Fig. 3 from Pe'er et al. on the left).
Quoting, once again, Pe'er et al.:
We observe that multimarker predictors based on combinations of alleles of two or three SNPs can capture (at r2 ≥ 0.8) an additional 9–25% of SNPs in ENCODE or HapMap Phase II.
Essentially the same thing was already shown in de Bakker et al. (2005), where they considered the number of SNPs needed to tag all SNPs in a reference panel. They compared the number needed when tag SNPs were considered in isolation, when they were considered pairwise, or when any number of tag SNPs could be used. The figure on the left (Fig. 2 from de Bakker et al.) shows the result. Much fewer SNPs are needed to tag unobserved SNPs if we consider more SNPs at a time.
All this seem to indicate that we might be better off in our search if we consider sets of markers instead of the markers independently.
It is not quite that simple, because of the multiple testing problem. There is a trade-off between number of tests and probability of capturing the causative SNP. This is also briefly addressed in Pe'er et al.:
However, when any additional testing (such as the addition of the multimarker tests) is performed, the benefits of capturing more variation need to be evaluated against the statistical cost of performing additional hypothesis testing. This is because addition of statistical tests could, in principle, lead to a reduction in power by requiring increased statistical significance thresholds to maintain constant type I error rates (or, conversely, allowing substantially more false positives if statistical thresholds are unchanged). This is because addition of statistical tests could, in principle, lead to a reduction in power by requiring increased statistical significance thresholds to maintain constant type I error rates (or, conversely, allowing substantially more false positives if statistical thresholds are unchanged).
Exhaustively testing all marker pairs, triplets, or higher order combination is not the way to go. We need to be a bit more sophisticated in how we use multiple markers.
One way to go about this is through population genetics and our understanding of the process that shaped the chromosomes we analyse. When we analyse part of a chromosome, we can construct a mathematical model of the history of that part of the chromosome -- the local genealogy -- explaining how our sampled chromosomes are related, and from that model we can extract the information we need to search for the disease marker.
There are many different ways of modelling the local genealogy, varying greatly in the details they model. In this series I will go through some of the methods based on local genealogies, starting with the most detailed models and moving towards the cruder approximations to the population genetics models (cruder approximations, but not necessarily crude analysis techniques, mind you).
The next post in the series will be on the ancestral recombination graph and how it can be used for association mapping. I hope to have finished that post in a week or two, so come back around that time :)
- Pe'er, I., de Bakker, P.I., Maller, J., Yelensky, R., Altshuler, D., Daly, M.J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics, 38(6), 663-667. DOI: 10.1038/ng1816
- de Bakker, P.I., Yelensky, R., Pe'er, I., Gabriel, S.B., Daly, M.J., Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics, 37(11), 1217-1223. DOI: 10.1038/ng1669