Archive for March 23rd, 2008

How to interpret a genome wide association study

Sunday, March 23rd, 2008

I just spotted this nice review of genome wide association studies:

How to Interpret a Genome-wide Association Study
Thomas A. Pearson and Teri A. Manolio
Journal of the American Medical Association 2008;299(11):1335-1344.

It is worth a read.

Heads or tails and reliable alignments

Sunday, March 23rd, 2008

ResearchBlogging.orgI have on several occasions written about the uncertainty inherent in inferred alignments and how this is a potential problem. I hadn't really thought it would be quite so serious as the results in the paper I just read:

Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments
Giddy Landan and Dan Graur
Molecular Biology and Evolution 2007 24(6):1380-1383; doi:10.1093/molbev/msm060

Abstract

The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such "ideal" alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

In this paper they analyse the quality of multiple sequence alignments in an extremely simple manner: They first align the sequences left to right, then reverse them to essentially align them right to left. Unless the alignment algorithm has a preferred order of symbols, you'd expect to get the same alignment going left to right as right to left.

Not always, of course: if the algorithm is based on oligonucleotides or such, then the order matters, but in many cases it doesn't.

Comparing head and tail alignments

When the order shouldn't matter, the left-to-right and right-to-left alignments (head and tail alignments in the paper) should be similar, so comparing them should give an indication of how much faith you can have in the inferred alignment.

They try this out on a family of 130 amino acid sequences of length around 400 using three different alignment tools. This is the result:

  ClustalW MUSCLE ProbCons
Columns 18.0% 8.7% 6.7%
Residue pairs 52.1% 53.7% 60.8%
Shared splits 64.6% 65.4% 59.1%

Here Columns denotes the fraction of identical columns in the alignment, Residue pairs denote the fraction of pairs (in a "sum of pairs" kind of way) that are identical, and Shared splits denote the fraction of identical splits (edges) in BioNJ inferred trees from the two alignments.Very few alignment columns are shared between the two alignments, but that is not that much of a problem. With 130 sequences you wouldn't expect to match many columns exactly. I'm more surprised that the resulting pairwise alignments (the pairwise alignments you get by extracting two rows from the alignment, the identity given in Residue pairs) were so different.It is also a bit shocking that inferred trees from the two alignments were so different.

What is causing this?

There is uncertainty in inferring alignments, but why would the same algorithm give different results when running left-to-right compared to right-to-left?

As far as I can see, there are two different things going on here. One having to do with there being more than one optimal alignment (also discussed in the paper), and one having to do with heuristics in searching for optimal alignments.

When there are more than one optimal alignment (which is often the case), even algorithms guaranteed to find an optimal alignment will give you and arbitrary one (though usually a deterministic arbitrary choice). The arbitrary choice can easily differ between running left-to-right or right-to-left.

For multiple sequence alignments, it is computational infeasible to guarantee to compute an optimal alignment, and heuristics are used to search for (near or locally) optimal alignments. This is often some variation on a greedy strategy, and each choice there will potentially lead to a different alignment. It is easy to see how left-to-right and right-to-left alignments can be different with such a strategy.

In any case, the take home message is, once again: don't trust alignments!


Landan, G., Graur, D. (2007). Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution, 24(6), 1380-1383. DOI: 10.1093/molbev/msm060

Statistical power and interacting genes

Sunday, March 23rd, 2008

ResearchBlogging.orgEarlier this week we discussed the paper below in our association mapping journal club. Lately we have been interested in epistasis (gene-gene interaction) in the context of association mapping -- we have just submitted a paper on the subject and have a few projects in the pipeline working on this problem -- and one problem that concerns us is the power of detecting gene-gene interaction in association mapping. This paper turned out not to really be about that, but it was interesting nonetheless.

Anyway, back to the paper:

Power of genome-wide association studies in the presence of interacting loci
Joseph Pickrell, Françoise Clerget-Darpoux, Catherine Bourgain
Genetic Epidemiology 31(7) 748 - 762

Abstract

Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and linkage disequilibrium (LD) structure. The power is also evaluated in situations where the causal single nucleotide polymorphisms (SNPs) may not be genotyped, but rather detected by proxy using a SNP in LD. A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the proportionality assumption, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable (Terwilliger and Hiekkalinna [2006] Eur J Hum Genet 14(4):426-437), in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.

The problem considered in the paper is the following: If we are searching for gene-disease association and the disease risk depends on an interaction between two variants, will we be able to detect it? I'm simplifying a bit here, but that is the essential question.

Testing single markers

The typical approach for finding genes that affect the disease risk, when analysing the entire genome in any case, is to go through each typed variant and test if the cases and controls have different distributions of genotype frequencies. I've described this in a bit more detail in an earler post, so I won't say much more on that here.

The power to detect an association when it is there, depend on several parameters, such as the allele frequencies, the sample size, and of course the strength of the effect the genotype has on the disease risk, typically measured by the genetic relative risk GRR. For a binary marker (what we typically consider), we can consider the risk of allele aa the "basic" risk (GRRaa=1) and talk about the relative risk of Aa and AA, GRRAa and GRRAA. Different "disease models" put constraint on these, e.g. a dominant model would have GRRaa=GRRAa=1 != GRRAA, but in general there are two risks that can vary in relation to the basic risk.

Gene-gene interaction

Now, if the disease risk depends on several markers, you can have various kinds of interaction. For two markers, you now have nine genotypes, {aa,Aa,AA}x{bb,Bb,BB}, with eight GRRs that can vary in relation to GRRaabb. Again, various "classical" disease models can put constraints on the GRRs.

The problem they consider in the paper is such a pair-wise interaction setup (with four different disease models), and how the power of detecting an association depends on the GRRs, disease model, allele frequencies, etc.

Detecting an association, here, means detecting an association at A or B (or both), but not detecting the right disease model, or detecting that there is really an interaction going on; it is still considered a "hit" if only one of the two markers is found to be associated with the disease. I'll get back to that below.

The way thay go about this is to calculate the marginal GRRs, i.e. the relative risks of AA and Aa when ignoring the B marker, and the GRRs of BB and Bb when ignoring the A marker. These marginal GRRs are, of course, affected by the (interaction) disease model, GRRs of the interacting pair, frequencies, etc, but once the marginal GRRs have been calculated, the power of detection can be computed as if no interaction was going on.

Indirect testing

Typically, we do not have all the variation typed, but rely on tagSNPs to indirectly test for association. The way this works is that the SNPs are correlated (this correlation is called linkage disequilibrium, LD) so the relative risk of one SNP "leaks into" a relative risk of another SNP. The GRRs of a tagSNP depend on the LD with the causal SNP(s) and the allele frequencies and is not straight forward, but as a rule of thumb there is the following relationship: if a sample size of N is needed to detect association at the causal marker, then a sample size of N / r2 is needed at the tagSNP, where r2 is a measure of LD.

Although mathematically justified, it is only a rule of thumb, and it is violated especially in the presence of interaction (where there is potentially LD between the tagSNP and both causal SNPs, to confuse the matter).

A large part of the paper is concerned with this rule of thumb, and in my opinion this is the most interesting part of the paper. We know very little about how we perform in tagging for interaction, since essentially all tagging algorithms are based on the r2 rule of thumb.

Not really about interaction

Since they define "detection of association" to be detection of a marginal association, we are not really considering power of detecting association. For the direct testing (when we are not considering tagSNPs), the interaction doesn't really come into play at all! The interaction model determines the marginal GRR, and as such it is interesting enough, but once we have the marginal GRR, there is nothing new in how we determine the power. The greater the GRR, the greater the power, but that is completely independent of interaction or not.

For the tagging consideration it is a different matter.  There the interaction has an effect, as I mentioned above, because both causal SNPs can be in LD with the tag, and that affects the r2 rule of thumb.

Still, the paper is about the power of detecting marginal association, not interaction, and it is possible (and not even that hard) to construct models where there is a strong interaction association but very little marginal effect.  For such a setup, a marginal test will never be powerful, and a full interaction model must be used.

It is the latter problem we are currently working on in my group.  How do we find pairs that interact but have little marginal effect? (we have just submitted a paper on that), what is the power to detect such interaction? and how well do we tag such interaction?


Pickrell, J., Clerget-Darpoux, F., Bourgain, C. (2007). Power of Genome-Wide Association Studies in the Presence of Interacting Loci. Genetic Epidemiology, 31, 748-762.