I often work with people geographically distributed in several countries. This mainly works using a lot of emails and instant messaging and the occasional phone meeting. For writing, we have once or twice tried Google documents but mostly it involves sending documents around by email. This is not really ideal, though. Locally at BiRC we keep documents in a version control repository, but we do not have a distributed version of this, and in any case it would probably require too much from most collaborators.

But now Google’s come up with a new feature:


From the add, it looks like a “google CMS”.

I think I’ll sign up for this and see what it can do…

In all honesty, I pulled those numbers out of a hat

Today my blog received a lot of traffic about this post from yesterday about the relative risk of disease genes. I wrote that the relative risk (RR) of the genetic variants we have discovered recently using genome wide association studies are rather small — 1.1 to 1.5 — and that such a small increase did not matter that much, all in all, and that I doubted that it would have much of an impact for us to know we have a gene that increases our risk that little.

All of this I stand by. The numbers for the relative risk are also consistent with the papers I’ve read, but I have not done a proper survey to see the actual distribution of the RRs. I am thinking about doing that now, but it is not quite as simple as it sounds to figure it out. There is something called “the winners curse” that essentially means that our estimates of the relative risk tends to be higher than the risk really is, because we estimate the risk from a biased sample: the sample where we discovered the risk in the first place. See Zöllner and Pritchard: Overcoming the winners curse for more on this.

I gave an example, however, where I said that increasing the risk of cancer from 0.1% to 0.15% — a relative risk of 1.5 — would have no consequence what so ever. Those numbers I just made up. I intentionally picked very small numbers to make a point, but it is a bit dishonest. I don’t know what realistic numbers would be, to be absolutely honest, but these are probably way too small for any “interesting” disease.

If the risk of a disease, without “risk genes”, is 0.1% I don’t think we would bother with it in the first place. It would be pretty hard to find enough cases for a study anyway.

Realistic numbers might be 5% to 7.5% or 10% to 15%. I don’t think it changes my point: people are not going to change their habits for such an increase in risk when the do not change their habit for much larger risks such as diet, exercise, smoking, drinking, etc. As Genome Technology Online puts it: That’s Because Risk Is Small and Inertia Is Great.

Anyway, I shouldn’t have made up numbers like that — even as an informal example to make a point — and I wouldn’t have if I knew this many people would read it…

Now I should probably go figure out some accurate numbers so I don’t make the same mistake again.

Association mapping and local genealogies

For a while I’ve been wanting to write a bit about my approach to association mapping, but I’d like to put it into a larger perspective, so I will instead write about how “local genealogies” can be used. This is a large topic, so I will split it into several posts over the coming weeks. At the end, I might collect it all in a small booklet, if there is interest for that.

This is the first part of the series, and I’ll use it to introduce and motivate the use of local genealogies.

Single marker approaches

In association mapping, we look for an association between a genotype and a phenotype. The simplest situation is where we check if a single genetic marker is associated with the phenotype. We can group our sampled individuals according to their genotype at the marker, and then examine the distribution of phenotypes in each group. If the distribution differs (statistically significant) between genotypes, then that particular marker is associated with the phenotype.

Testing association

Genotype contingency table

An example will make this clearer. Consider a binary phenotype: either you are affected with the disease or you are unaffected. Then consider the common case where the marker of interest is a single nucleotide polymorphism (SNP) with two different alleles, say A and a. This gives us a contingency table like the one shown on the left.

The association test is the statistical test checking of the Affected row in the table follows the same distribution as the Unaffected row. This is a well-known problem in statistics that can be checked using e.g. Pearson’s chi-square test or Fisher’s exact test.

Allele contingency table

Sometimes, when there is reason to believe that the disease has an an additive genetic component, meaning that the “at risk” allele, say A, increases the risk twice as much when homozygotic compared to heterzygotic, a simpler table that only groups by allele is used (see figure on the left). If the disease really does behave this way, the simpler table is more powerful in detecting the association in the statistical test.

The underlying statistical problem is the same, however: testing if the genotype and phenotype are independent or not.

If the Affected row has a different distribution that the Unaffected row, then the rows and columns are not independent and there is an association between genotype and phenotype. This is the kind of markers we are looking for.

Mind you, a test like this is saying nothing about causality: the test is not saying that having one genotype rather than another actually increases your disease risk. It is only saying that in the population as such there is a correlation between the marker and the disease. It is a good starting point to find out about the genetic component of the disease, but nothing more.

Searching for associated markers

This is how we test if a given single marker is associated with the disease. When we go searching for markers associated with the disease, we simply test every marker we have genotyped for association in this way.

Do we then need to test all 3 billion nucleotides in our genome? No, not really. First of all, there is no real variation in the vast majority of nucleotides, so there is nothing to test there. Secondly, the nucleotides are correlated, so by testing some of the nucleotides, we will also learn about potential association with others and we do not need to test these explicitly then.

We can “tag” unobserved SNPs with a subset of the existing SNPs — unimaginatively called “tagSNPs”. This way we indirectly survey the entire genome, but we can get away with only testing 500,000 to 1 million SNPs.

We need to be careful in how we select these tagSNPs, though. To select them, we need to know how they are correlated with those we do not test. This correlation we have learned from large studies such as the HapMap project. An important measure for the correlation between SNPs is the so-called r2 measure, that is just the square of the statistical correlation between them. If we pick tagSNPs such that the untyped SNPs are in high r2 with the tagSNPs — typically we aim for >0.8 — we have a good chance of also picking up association by this indirect test.

Even if we can reduce the number of tests to a million, there is still a problem, though. The way statistical testing is done, we consider something a “hit” if we observe something that is unlikely unless there is a true signal. Unlikely, but not impossible. The definition of unlikely, in this case, could be that it only occurs 5% or 1% of the time by chance if there is no true signal.

If each test will report a falls hit with 1% probability, we expect 10,000 false hits if we perform 1 million tests. This is know as the multiple testing (or multiple comparison) problem, and we need to correct for this to reduce the number of false findings.

If the tests are completely independent, a conservative correction is the Bonferroni correction, but SNPs are correlated, even tagSNPs, so this test is not necessarily appropriate here. It is still commonly used, though, since the correct way of correcting for multiple tests in association mapping is still up for debate and far from well understood.

Multi-marker approaches

Considering only single markers at a time is simple and results are easy to interpret, but testing every marker independently does have some problems. The most serious problem is the statistical power to detect disease association indirectly through tag SNPs.

In Pe’er et al. (2006), they considered this problem and analysed how well the SNPs on commercial genotyping chips tagged other known, but untyped, SNPs.

Figure 2 from Pe’er et al.They analysed which fraction of untyped SNPs from HapMap phase II or ENCODE, respectively, were tagged by the SNPs on the chip (for ENCODE distinguishing between SNPs with minor allele frequency, MAF, above 5% or 1%) — see the figure on the left (Fig 2 from Pe’er et al.).

For Europeans (CEU) and Asians, (CHB+JPT) slightly more than half of the SNPs with MAF > 5% were tagged, and for Africans (YRI) slightly less than half the SNPs were tagged. For SNPs with MAF > 1% but < 5%, much fewer were tagged.

Quoting Pe’er et al.:

Even though the majority of common variants is captured by the current generation of genome-wide arrays, there is a substantial component of common variation not highly correlated to a SNP on each array.

Figure 3 from Pe’er et al.This is when they tried to tag SNPs with individual markers. They also tried tagging SNPs using two or three typed markers to predict the untyped markers, and observed that using more markers improved the tagging (see Fig. 3 from Pe’er et al. on the left).

Quoting, once again, Pe’er et al.:

We observe that multimarker predictors based on combinations of alleles of two or three SNPs can capture (at r2 ≥ 0.8) an additional 9–25% of SNPs in ENCODE or HapMap Phase II.

Figure 2 from de Bakker et al (2005)

Essentially the same thing was already shown in de Bakker et al. (2005), where they considered the number of SNPs needed to tag all SNPs in a reference panel. They compared the number needed when tag SNPs were considered in isolation, when they were considered pairwise, or when any number of tag SNPs could be used. The figure on the left (Fig. 2 from de Bakker et al.) shows the result. Much fewer SNPs are needed to tag unobserved SNPs if we consider more SNPs at a time.

All this seem to indicate that we might be better off in our search if we consider sets of markers instead of the markers independently.

It is not quite that simple, because of the multiple testing problem. There is a trade-off between number of tests and probability of capturing the causative SNP. This is also briefly addressed in Pe’er et al.:

However, when any additional testing (such as the addition of the multimarker tests) is performed, the benefits of capturing more variation need to be evaluated against the statistical cost of performing additional hypothesis testing. This is because addition of statistical tests could, in principle, lead to a reduction in power by requiring increased statistical significance thresholds to maintain constant type I error rates (or, conversely, allowing substantially more false positives if statistical thresholds are unchanged). This is because addition of statistical tests could, in principle, lead to a reduction in power by requiring increased statistical significance thresholds to maintain constant type I error rates (or, conversely, allowing substantially more false positives if statistical thresholds are unchanged).

Exhaustively testing all marker pairs, triplets, or higher order combination is not the way to go. We need to be a bit more sophisticated in how we use multiple markers.

Local genealogies

One way to go about this is through population genetics and our understanding of the process that shaped the chromosomes we analyse. When we analyse part of a chromosome, we can construct a mathematical model of the history of that part of the chromosome — the local genealogy — explaining how our sampled chromosomes are related, and from that model we can extract the information we need to search for the disease marker.

There are many different ways of modelling the local genealogy, varying greatly in the details they model. In this series I will go through some of the methods based on local genealogies, starting with the most detailed models and moving towards the cruder approximations to the population genetics models (cruder approximations, but not necessarily crude analysis techniques, mind you).

The next post in the series will be on the ancestral recombination graph and how it can be used for association mapping. I hope to have finished that post in a week or two, so come back around that time :)

  1. Pe’er, I., de Bakker, P.I., Maller, J., Yelensky, R., Altshuler, D., Daly, M.J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics, 38(6), 663-667. DOI: 10.1038/ng1816
  2. de Bakker, P.I., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J., Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics, 37(11), 1217-1223. DOI: 10.1038/ng1669

How important are genetic risks really?

This post is inspired, but not really related to this blog post: In what we trust. It is an interesting read, so go read it, but the one sentence that I want to focus on is this (emphasis mine):

The headlines seem inevitably to contrast starkly with the output of government and industry that seeks to quash our fears and to emphasise how doubling a tiny, tiny risk is no big deal.

With personal genomics getting big, I’ve been thinking about the impact of knowing your genes and the disease risk they carry. With the validated findings we have, the relative risks are all very small. Somewhere between 1.1 and 1.5, most of them. Having one variant over another might increase your risk of a certain kind of cancer from 0.1% to 0.15%. Is that really going to matter for you?

Considering that smoking has a massively larger relative risk, and that so many people still smoke, do you really think that they will change their lifestyle if we tell them that they are in a risk group due to their genes, if the increased risk is really that small? Do you think they people will change their diet if their increased risk of diabetes is that small?

I seriously doubt it.

What we learn from association mapping about the genetics of diseases is important for our understanding of those diseases, but — human nature being as it is — I don’t think it will matter much for the individual to know his personal genetic risk.