Archive for March, 2008

Statistical power and interacting genes

Sunday, March 23rd, 2008

ResearchBlogging.orgEarlier this week we discussed the paper below in our association mapping journal club. Lately we have been interested in epistasis (gene-gene interaction) in the context of association mapping -- we have just submitted a paper on the subject and have a few projects in the pipeline working on this problem -- and one problem that concerns us is the power of detecting gene-gene interaction in association mapping. This paper turned out not to really be about that, but it was interesting nonetheless.

Anyway, back to the paper:

Power of genome-wide association studies in the presence of interacting loci
Joseph Pickrell, Françoise Clerget-Darpoux, Catherine Bourgain
Genetic Epidemiology 31(7) 748 - 762

Abstract

Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and linkage disequilibrium (LD) structure. The power is also evaluated in situations where the causal single nucleotide polymorphisms (SNPs) may not be genotyped, but rather detected by proxy using a SNP in LD. A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the proportionality assumption, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable (Terwilliger and Hiekkalinna [2006] Eur J Hum Genet 14(4):426-437), in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.

The problem considered in the paper is the following: If we are searching for gene-disease association and the disease risk depends on an interaction between two variants, will we be able to detect it? I'm simplifying a bit here, but that is the essential question.

Testing single markers

The typical approach for finding genes that affect the disease risk, when analysing the entire genome in any case, is to go through each typed variant and test if the cases and controls have different distributions of genotype frequencies. I've described this in a bit more detail in an earler post, so I won't say much more on that here.

The power to detect an association when it is there, depend on several parameters, such as the allele frequencies, the sample size, and of course the strength of the effect the genotype has on the disease risk, typically measured by the genetic relative risk GRR. For a binary marker (what we typically consider), we can consider the risk of allele aa the "basic" risk (GRRaa=1) and talk about the relative risk of Aa and AA, GRRAa and GRRAA. Different "disease models" put constraint on these, e.g. a dominant model would have GRRaa=GRRAa=1 != GRRAA, but in general there are two risks that can vary in relation to the basic risk.

Gene-gene interaction

Now, if the disease risk depends on several markers, you can have various kinds of interaction. For two markers, you now have nine genotypes, {aa,Aa,AA}x{bb,Bb,BB}, with eight GRRs that can vary in relation to GRRaabb. Again, various "classical" disease models can put constraints on the GRRs.

The problem they consider in the paper is such a pair-wise interaction setup (with four different disease models), and how the power of detecting an association depends on the GRRs, disease model, allele frequencies, etc.

Detecting an association, here, means detecting an association at A or B (or both), but not detecting the right disease model, or detecting that there is really an interaction going on; it is still considered a "hit" if only one of the two markers is found to be associated with the disease. I'll get back to that below.

The way thay go about this is to calculate the marginal GRRs, i.e. the relative risks of AA and Aa when ignoring the B marker, and the GRRs of BB and Bb when ignoring the A marker. These marginal GRRs are, of course, affected by the (interaction) disease model, GRRs of the interacting pair, frequencies, etc, but once the marginal GRRs have been calculated, the power of detection can be computed as if no interaction was going on.

Indirect testing

Typically, we do not have all the variation typed, but rely on tagSNPs to indirectly test for association. The way this works is that the SNPs are correlated (this correlation is called linkage disequilibrium, LD) so the relative risk of one SNP "leaks into" a relative risk of another SNP. The GRRs of a tagSNP depend on the LD with the causal SNP(s) and the allele frequencies and is not straight forward, but as a rule of thumb there is the following relationship: if a sample size of N is needed to detect association at the causal marker, then a sample size of N / r2 is needed at the tagSNP, where r2 is a measure of LD.

Although mathematically justified, it is only a rule of thumb, and it is violated especially in the presence of interaction (where there is potentially LD between the tagSNP and both causal SNPs, to confuse the matter).

A large part of the paper is concerned with this rule of thumb, and in my opinion this is the most interesting part of the paper. We know very little about how we perform in tagging for interaction, since essentially all tagging algorithms are based on the r2 rule of thumb.

Not really about interaction

Since they define "detection of association" to be detection of a marginal association, we are not really considering power of detecting association. For the direct testing (when we are not considering tagSNPs), the interaction doesn't really come into play at all! The interaction model determines the marginal GRR, and as such it is interesting enough, but once we have the marginal GRR, there is nothing new in how we determine the power. The greater the GRR, the greater the power, but that is completely independent of interaction or not.

For the tagging consideration it is a different matter.  There the interaction has an effect, as I mentioned above, because both causal SNPs can be in LD with the tag, and that affects the r2 rule of thumb.

Still, the paper is about the power of detecting marginal association, not interaction, and it is possible (and not even that hard) to construct models where there is a strong interaction association but very little marginal effect.  For such a setup, a marginal test will never be powerful, and a full interaction model must be used.

It is the latter problem we are currently working on in my group.  How do we find pairs that interact but have little marginal effect? (we have just submitted a paper on that), what is the power to detect such interaction? and how well do we tag such interaction?


Pickrell, J., Clerget-Darpoux, F., Bourgain, C. (2007). Power of Genome-Wide Association Studies in the Presence of Interacting Loci. Genetic Epidemiology, 31, 748-762.

Maxima, a computer algebra system

Saturday, March 22nd, 2008

The last couple of days, I have been learning to use Maxima, a computer algebra system. I haven't really used a symbolic computer algebra system for years. I used Maple the first two years when I studied math, but after that, I never really felt that I needed it. When getting my computer science degree, that was true, but now that I am doing bioinformatics I do a lot more math than I used to.

wxMaxima screenshotA couple of colleagues use Mathematica, but I am too cheap for that, so I went looking for an open source alternative. I did that a few years ago, but didn't find anything I like (I even tried Maxima back then, but wasn't impressed at the time). Then, earlier this week, one of the post docs at BiRC was using wxMaxima, a GUI wrapping the Maxima system, and it looked really nice, so I decided to give it another try. This time, I'm more impressed. I'm not really sure if it is because my maths skills have improved, or just that the GUI makes it much nicer to work with, though.

Now, I don't know Mathematica and it has been ages since I used Maple, so I don't know how it compares with those. I would imagine that its feature set is limited compared to those, but I do not really need that complicated mathematics, so I doubt it is something that will be a problem for me.

I don't imagine that I will ever do any serious programming in Maxima, though. I will probably only use it to solve a few equations here and there and leave the serious data analysis to R and my numerical computation tasks to Octave or C++ with GSL.

Google docs and gadgets

Saturday, March 22nd, 2008

Now isn't this just cool? Now you can visualise data from your Google spreadsheets on your iGoogle homepage!

Expelled...

Saturday, March 22nd, 2008

Here's a bit of a funny story:

Go read it.

Painting "go faster" stripes on HapCluster

Friday, March 21st, 2008

These days I'm working a lot on my HapCluster association mapping tool. I'm hoping to make a version that scales to genome wide studies.

My "back of an envelope" calculations based on an older version of HapCluster told me it should be possible to analyse a whole genome data set in about a CPU week, which is roughly the time it takes to phase it using the BEAGLE tool, which is the fastest I know about. Since HapCluster doesn't need phasing, it would then be just as fast as analysing the data with Blossoc after phasing (which is much faster than using Blossoc on data that isn't phased).

Unfortunately, some recent changes to HapCluster has made it much slower...

How did I get started with HapCluster?

HapCluster is a method developed by Ed Waldron, David Balding and John Whittaker, all from Imperial College, London (although Ed is now working for a pharma company, as far as I know). The method is described in the paper

Fine Mapping of Disease Genes via Haplotype Clustering.
E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

I got involved with the method after attending a talk by Ed in Oxford (and discussing it with him in the pub afterwards -- take that, recent Beer study!). He had implemented the method in R and I suggested to re-implement it in C++, mainly because I wanted to make a fair comparison with Blossoc -- Blossoc was slightly better at localising disease variants, but a runtime comparison with an R implementation would be a bit unfair.

So anyway, he sent me the R code and I translated it into C++, but along the way got a few ideas to further improvements and started working with those. Later I met up with Ed and David a few times to continue the work. I'm going to visit David for about a week next month as well.

How does HapCluster work?

The method is quite simple, which is probably also why it is reasonably fast. It is based on clustering sequences, based on their local similarity around a possible disease locus, and then assign a likelihood to the clustering based on the way cases and controls are distributed in the clusters.

The disease locus, the clustering, and everything else, is explored using a Markov Chain Monte Carlo (MCMC) method. From this, the posterior probability for the location of the disease locus can be extracted, and association can be tested through the Bayes factor when comparing the HapCluster model with a null model that assumes that cases and controls are distributed independently from the genotype.

It is the latter, the Bayes factor, I plan to use in a genome wide setting. I will run the method in sliding windows along the genome, and for each window extract the Bayes factor.

How did I slow it down?

With the help of two students, I refactored the code to make it easier to extend the method. I had made an unholy mess of template programs to mix different likelihood methods and different types of data without paying a runtime penalty, but that code just turned out too hard to modify, so we re-wrote the code to use dynamic virtual function dispatching (taking care to limit how often we called virtual functions so we wouldn't slow the code down too much).

It did slow the code down a little bit, but not enough to be a major problem. But then I added code to automatically detect convergence of the Markov chain, and to adjust the number of iterations of the chain to always get the same effective sample size. This is necessary if I want to deal with genome wide data, where I cannot manually examine the output of each run, but now I have a time problem. In some data sets, it takes ages for the diagnostics to accept that the chain has converged.

How do I speed it up?

It seems to me that I have two ways to speed up the method: Make each iteration in the Markov chain faster, or take fewer steps.

Speeding each step up is not as easy as it sounds. I am spending around 80% of the total runtime in the same function, which is the function that compares the local sequences. If I could just speed that function up, I would gain a lot, but the code there is pretty tight so I don't see any obvious ways of doing it. I am playing with the idea of re-writing the data structure I have the sequences in, so I can exploit parallel instructions on the newer chips, but it is a major change to the tool that might not give me much in the long run, so I will keep that idea for a thesis student some day and not bother with it myself :)

Taking fewer steps is not immediately a good idea. I need the chain to have converged before I can start sampling from it, and I really do need to get the right number of effective samples to get a reasonable result. If the problem is that on some data, the chain takes longer to converge, there is really not much I can do about it.

But I have a suspicion that the chain really has converged and it is just my diagnostics that gets confused. The way I determine convergence is somewhat ad hoc as all such methods are. I sample the numerical parameters I have in the model and compare the first half of the samples with the second half of a chain run for a fixed number of iterations. If a t-test says they have different means, I assume that the chain didn't converge, but if the t-test says they have the same mean, then I accept convergence.

Now, I have noticed that when the chain takes a larger number of iterations before the diagnostics accept convergence, then the number of samples needed to get a fixed effective number of samples is very high. This means that in the cases where the burn in takes a long time, I have a higher auto-correlation between samples.

If the chain moves slowly through the state space, then even if it has converged I will probably see different means between the first and second half of a run.

My idea now is to increase the thinning between samples, based on estimates of the effective samples size, during burn in. It means that I will increase the number of iterations in some cases, but if my theory is correct, I will not run into the problems where I have converged but my test keeps rejecting this.

Let's see how it works...