Posts Tagged ‘Paper reviews’

Statistical power and interacting genes

Sunday, March 23rd, 2008

ResearchBlogging.orgEarlier this week we discussed the paper below in our association mapping journal club. Lately we have been interested in epistasis (gene-gene interaction) in the context of association mapping — we have just submitted a paper on the subject and have a few projects in the pipeline working on this problem — and one problem that concerns us is the power of detecting gene-gene interaction in association mapping. This paper turned out not to really be about that, but it was interesting nonetheless.

Anyway, back to the paper:

Power of genome-wide association studies in the presence of interacting loci
Joseph Pickrell, Françoise Clerget-Darpoux, Catherine Bourgain
Genetic Epidemiology 31(7) 748 – 762

Abstract

Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and linkage disequilibrium (LD) structure. The power is also evaluated in situations where the causal single nucleotide polymorphisms (SNPs) may not be genotyped, but rather detected by proxy using a SNP in LD. A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the proportionality assumption, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable (Terwilliger and Hiekkalinna [2006] Eur J Hum Genet 14(4):426-437), in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.

The problem considered in the paper is the following: If we are searching for gene-disease association and the disease risk depends on an interaction between two variants, will we be able to detect it? I’m simplifying a bit here, but that is the essential question.

Testing single markers

The typical approach for finding genes that affect the disease risk, when analysing the entire genome in any case, is to go through each typed variant and test if the cases and controls have different distributions of genotype frequencies. I’ve described this in a bit more detail in an earler post, so I won’t say much more on that here.

The power to detect an association when it is there, depend on several parameters, such as the allele frequencies, the sample size, and of course the strength of the effect the genotype has on the disease risk, typically measured by the genetic relative risk GRR. For a binary marker (what we typically consider), we can consider the risk of allele aa the “basic” risk (GRRaa=1) and talk about the relative risk of Aa and AA, GRRAa and GRRAA. Different “disease models” put constraint on these, e.g. a dominant model would have GRRaa=GRRAa=1 != GRRAA, but in general there are two risks that can vary in relation to the basic risk.

Gene-gene interaction

Now, if the disease risk depends on several markers, you can have various kinds of interaction. For two markers, you now have nine genotypes, {aa,Aa,AA}x{bb,Bb,BB}, with eight GRRs that can vary in relation to GRRaabb. Again, various “classical” disease models can put constraints on the GRRs.

The problem they consider in the paper is such a pair-wise interaction setup (with four different disease models), and how the power of detecting an association depends on the GRRs, disease model, allele frequencies, etc.

Detecting an association, here, means detecting an association at A or B (or both), but not detecting the right disease model, or detecting that there is really an interaction going on; it is still considered a “hit” if only one of the two markers is found to be associated with the disease. I’ll get back to that below.

The way thay go about this is to calculate the marginal GRRs, i.e. the relative risks of AA and Aa when ignoring the B marker, and the GRRs of BB and Bb when ignoring the A marker. These marginal GRRs are, of course, affected by the (interaction) disease model, GRRs of the interacting pair, frequencies, etc, but once the marginal GRRs have been calculated, the power of detection can be computed as if no interaction was going on.

Indirect testing

Typically, we do not have all the variation typed, but rely on tagSNPs to indirectly test for association. The way this works is that the SNPs are correlated (this correlation is called linkage disequilibrium, LD) so the relative risk of one SNP “leaks into” a relative risk of another SNP. The GRRs of a tagSNP depend on the LD with the causal SNP(s) and the allele frequencies and is not straight forward, but as a rule of thumb there is the following relationship: if a sample size of N is needed to detect association at the causal marker, then a sample size of N / r2 is needed at the tagSNP, where r2 is a measure of LD.

Although mathematically justified, it is only a rule of thumb, and it is violated especially in the presence of interaction (where there is potentially LD between the tagSNP and both causal SNPs, to confuse the matter).

A large part of the paper is concerned with this rule of thumb, and in my opinion this is the most interesting part of the paper. We know very little about how we perform in tagging for interaction, since essentially all tagging algorithms are based on the r2 rule of thumb.

Not really about interaction

Since they define “detection of association” to be detection of a marginal association, we are not really considering power of detecting association. For the direct testing (when we are not considering tagSNPs), the interaction doesn’t really come into play at all! The interaction model determines the marginal GRR, and as such it is interesting enough, but once we have the marginal GRR, there is nothing new in how we determine the power. The greater the GRR, the greater the power, but that is completely independent of interaction or not.

For the tagging consideration it is a different matter.  There the interaction has an effect, as I mentioned above, because both causal SNPs can be in LD with the tag, and that affects the r2 rule of thumb.

Still, the paper is about the power of detecting marginal association, not interaction, and it is possible (and not even that hard) to construct models where there is a strong interaction association but very little marginal effect.  For such a setup, a marginal test will never be powerful, and a full interaction model must be used.

It is the latter problem we are currently working on in my group.  How do we find pairs that interact but have little marginal effect? (we have just submitted a paper on that), what is the power to detect such interaction? and how well do we tag such interaction?


Pickrell, J., Clerget-Darpoux, F., Bourgain, C. (2007). Power of Genome-Wide Association Studies in the Presence of Interacting Loci. Genetic Epidemiology, 31, 748-762.

Computing the all-pairs quartet distance on a set of evolutionary trees

Friday, March 14th, 2008

ResearchBlogging.orgToday we published a journal version of a computer science paper from APBC 2006 (in a special issue volume of the best papers selected from the conference). We (the same group of authors) had two papers at the conference, both related to calculating something called the quartet distance between trees.

The paper selected for the journal was:

Computing the All-Pairs Quartet Distance on a Set of Evolutionary Trees [pdf of preprint]

M. Stissing, T. Mailund, C.N.S. Pedersen, G.S. Brodal, and R. Fagerberg

Journal of Bioinformatics and Computational Biology 6(1) 37–50 2008.

Abstract

We present two algorithms for calculating the quartet distance between all pairs of trees in a set of binary evolutionary trees on a common set of species. The algorithms exploit common substructure among the trees to speed up the pairwise distance calculations thus performing significantly better on large sets of trees compared to performing distinct pairwise distance calculations, as we illustrate experimentally, where we see a speedup factor of around 130 in the best case.

The other paper was

Computing the Quartet Distance between Evolutionary Trees of Bounded Degree

M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, and R. Fagerberg
Proceedings of the Asia-Pacific Bioinformatics Conference (APBC) 2007, pp. 101–110, Series on Advances in Bioinformatics and Computational Biology, vol. 5, Imperial College Press.

Abstract

We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d2n2) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d9n log n) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.

a much more technical paper that I won’t describe in details here, but you can see my presentation from the conference (of both papers) here:

What is the quartet distance?

The quartet distance is a way of measuring the similarity (or distance) between two trees. There are various different measures, with different properties, and this is one of them.

All possible quartet topologies.The idea is that whenever you select a subset of the leaves of the tree, a subset of four leaves (a quartet) you can consider the tree you would get if you removed every other leaf from the tree. The topology of this sub-tree is the quartet topology, and the four possible quartet topologies are shown on the left. For two trees, and a given quartet, you can consider the quartet topologies in the two trees and compare them to see if they are the same or if they are different. If you do this for all quartets, and count the number of topologies that differ, what you get is the quartet distance between the two trees.

Calculating the quartet distance

There are n choose four quartets in a tree with n leaves, so if you enumerate them all from both trees and (somehow) figure out their topology in constant time, you can sort the list (using radix sort) and compare them in time proportional to the number of quartets, all in all giving you a O(n4) algorithm. Much too slow for any reasonably large n.

If we only consider binary trees, we can do better. (The case of non-binary trees is a bit tricky, but we have a couple of papers about that problem that I might come back to in a later post).

Table countingThe trick is to build a table with an entry for each pair of sub-trees, one from either of the two input trees. The table will have n2 entries (there are n sub-trees in each of the two input trees) and will contain the number of leaves shared by the two sub-trees. Using a dynamic programming algorithm (and using that the trees are binary) it possible to fill out this table in time O(n2) time, and from this table we can count the number of shared quartets between the two input trees (see the expression in the figure on the left). We can then get the quartet distance by subtracting this number from the total number of quartets — n choose four. Viola, an O(n2) algorithm.

Comparison of the two algorithmsUsing a colouring scheme, a lots of tricks and clever data structures, Brodal et al. improved on this to get a O(n log n) algorithm, but there is a large overhead to each comparison in that, so the simpler algorithm is usually faster. Both algorithms are implemented in my software QDist, and a comparison of them, for growing n can be seen on the left. (Actually, the implementation in QDist runs in O(n log2 n), but the extra factor of log n doesn’t pay off in any way in the real running time).

As you can see on the plot, you need n above 2000 to get any benefit of the more complex algorithm.

Calculating the quartet distance between all pairs of a set of trees

Now, if you have a set of k trees and you want to know the quartet distance between each pair, you would need O(k2n2) using the O(n2) algorithm.

This, however, ignores that you will know something about the trees you are comparing when you have seen them in comparisons with other trees from the set. Whenever they share sub-trees, the quartets from those sub-trees will always be shared, so running through the trees and counting them again is really a waste of effort.

To exploit this, all you need is a data structure that captures shared trees.

The trick we used was to root the trees (in an arbitrary leaf) after which we could merge them into a directed acyclic graph (DAG). It was then a small matter of modifying the two algorithms mentioned above to work on the DAG instead of trees.

The speedup depends a lot on how much the DAG compresses the original trees, but for trees that are reasonably similar (the case where calculating distances between them is most interesting), the compression turns out to be pretty good, and we could get very significant improvements. Improvements of more than 100 in running time, for the cases where we ran our experiments.


Stissing, M., Mailund, T., Pedersen, C.N., Brodal, G.S., Fagerberg, R. (2008). Computing the all-pairs quartet distance on a set of evolutionary trees. Journal of Bioinformatics and Computational Biology, 6(1), 37-50.

Associate editor of BMC Research Notes

Tuesday, March 11th, 2008

I’ve just become associate editor of a new journal: BMC Research Notes. See the list of editors here. I haven’t received the first paper yet, but I look forward to doing this.

More on adapting to climate

Sunday, March 9th, 2008

In a previous post I mentioned this review of a recent PLoS Genetics paper. I still haven’t read the actual paper, I am ashamed to admit, but we will read it for a journal club at BiRC next week. Anyway, this morning I read another interesting review, again at Genetic Future: Climate genes: positive or balancing selection?

I should probably bring it to our journal club.

The points in this review is that a linear relationship between climate and gene frequency would only be linear if we ignored the history of the human diaspora out of Africa. The time where selection has affected the genes varies a lot from South East Asia, where humans got to early, to South America, where humans got to late.

Would this type of selection actually result in a neat linear trend, like that seen for the RAPTOR gene? Well, it might, if the timing was just right, but it’s by no means a necessary outcome. There are at least three variables in play here, each of which will have some effect on the current frequency of a positively selected allele: the strength of selection, the starting frequency of the allele in that population, and the amount of time the population has existed in its current environment. For positive selection to result in a clean linear correlation between allele frequency and a climate variable, the latter two factors would have to have had a negligible impact, so that most of the variation is determined by selection intensity.

I think that’s pretty unlikely given what we know about human population history: native Americans, for instance, are the descendants of a cold-adapted population living in Siberia that only relatively recently moved down into the warmer climates of central America; selection has not yet had much time to act in these populations. In contrast, humans in Southern Asia have been in their current climate much longer, giving selection more time to do its work. Thus for variants under positive selection, current frequency will be substantially affected by historical contingencies, and the correlation between allele frequency and selective strength will be rough at best.

There’s also a reference to Voight et al. 2007, a paper I reviewed a few weeks back, on signals of selection in humans, based on extended haplotypes around genes under positive selection. Apparently, these signals are missing for the climate genes.

There is an alternative to positive selection, that could also explain the association between climate and gene frequency:

You’ve probably already guessed my hypothesis: at least some of the genes pulled out from this study (and probably the ones with the tightest correlations) have been the targets of balancing selection. Balancing selection could be acting on climate genes in different ways, but in my mind the most likely mechanism is via heterozygote advantage.

This model would result in each population reaching a stable allele frequency that is correlated with the local temperature, regardless of its starting frequency and how long the population had been subjected to that particular environment – so long as there has been enough time for the population to reach equilibrium. This scenario is much more likely to result in a linear correlation between allele frequency and climate variables than a simple positive selection model.

Now, what I am wondering is, how strong does the selection have to be for the allele frequency to reach equilibrium in the late arrival populations (say South Americans), and what kind of signals would we look for in the genome to test if balancing, rather than positive, selection is going on?

I really don’t know — I am too new to genetics to even have a clue — but I bet that this is old stuff in the genetics literature. I’ll have to ask around…

Uncertainty in inferred alignments

Monday, March 3rd, 2008

ResearchBlogging.org
Here’s yet another paper addressing the uncertainty in inferred alignments that is typically ignored when doing comparative genomics. For two others, see my reviews: Alignment bias in genomics and Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes.

Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

Lunter et al.

Genome Res. 18:298-309, 2008

Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

The paper itself actually has a funny story that I was witness when I worked in Oxford, but I’ll keep that story out of here. Those who know it will nod — or shake their head, as the case might be — and those who do not probably should hear it from the authors rather than me ;-)

The problem with alignments

Most comparative genomic analysis rely on having an alignment between the genomes being compared. The problem is that we never have such an alignment, but need to infer it. For highly similar sequences, this is not much of a problem, but for even relatively closely related species — such as men and mice — we only have really closely related sequences for conserved bits of the genome. We are relatively good at aligning genes, but to really analyse genomes, we cannot rely on only the genes. Especially when we want to infer for example divergence times, where we want to look at neutrally evolving sites and where genes will give us a biased sample as there is usually selection against changes there.

If we align genomes anyway, we need to take into account the uncertainty there is in the alignment, but we typically don’t! Once we have inferred an alignment, we treat it as absolute truth. With any other parameter we infer we are expected to report the uncertainty of the estimate together with our estimate, but for alignments we do not.

Probably because this is a lot more difficult to do, but still, completely ignoring the problem just because it is difficult is probably not the way to go.

It might not be such a big problem if the errors in alignment were unbiased, and we based our further inference on large alignments (and thus a large number of alignment columns), but it seems like there is a certain bias in most alignment algorithms.

The source of this bias should be found in the the approach underlying most (if not all) alignment algorithms: optimising some alignment score (or minimising some alignment penalty). Searching for an “optimal” alignment typically means finding an alignment with as few changes as possible — with varying definitions of “few changes” — and this strategy will tend to infer alignments with fewer indels than in the true alignment.

Alignment biasesLunter et al. considers the case of pair-wise alignment, and identifies the typical alignment biases (essentially the same biases identified in Lunter 2007). These are shown on the left, where the left-hand side shows the true alignment and the right-hand side the alignment that will typically be inferred. In the two top-most cases, the inferred alignment places the indels incorrectly because (A) moving the indel aligns columns with a more consertation, or (B) two independent indels can be replaced by a single longer indel. In the two other cases, the indels are misplaced because the resulting alignment this way introduces fewer gaps.

Results of alignment biasThere are two expected consequences of these biases: Alignment accuracy decreases close to indels, and indels tend to be merged if near to each other. At the same time, the proportion of identity (columns with no substitutions) increases near indels. In a simulation study, Lunter et al. demonstrates that this is indeed the case. The figure on the left shows the accuracy and proportional sequence identity as a function of the distance to the nearest gap (A) and the distribution of inter-gap lengths (B).

Fixing the problem

Using statistical alignment methods (an application of hidden Markov models), it is possible to capture not only the optimal alignment — the maximum likelihood alignment, in this case — but also the uncertainty in the inferred alignment. Using a technique called “posterior decoding” it is possible to assign the probability that a given alignment column is correct to the individual columns. This way, problematic areas of an alignment can be identified.

Not only can posterior decoding annotate an existing alignment, posterior decoding can also tell us the probability that any particular pair of nucleotides should be aligned, implicitly considering the set of all possible alignments where that particular pair is considered homologue. It is possible to construct an alignment from this information, by selecting the alignment that maximises the product of the probabilities assigned to each column in the alignment. This approach differs from the maximum likelihood alignment by not considering the transition probabilities in the underlying hidden Markov model, but can produce better alignments, in the sense that they closer match the true alignment.

Lunter et al. expands on this idea by changing the posterior probability for aligning nucleotides to gaps. Instead of weighting a column with the probability that a given nucleotide matches a particular gap, they weight it with the probability that it matches any gaps. The alignment is then constructed the same way as the posterior decoding algorithm.

The intuition is that around gaps, any posterior is low (compared to nucleotides well away from gaps), but by re-weighting this way, a nucleotide is more likely to align up against a gap when it really should align to a gap.

Comparison of Viterby (maximum likelihood), posterior decoding, and marginal posterior decodingThey then show that this change improves the alignment by both increasing the sensitivity (S) — the ratio of correctly alignment columns to all homologous colums — and reducing the false-positive fraction (FPF) — wrongly aligned nucleotides over non-gapped column — and reducing the non-homologous fraction — the fraction of aligned columns that are not truly homologous. The figure on the left compares the maximum likelihood alignment (calculated by the Viterbi algorithm), with the posterior decoding algorithm and their new marginal posterior decoding algorithm.

Comparison with other algorithms.They also compare with other popular alignment algorithms and show improvements, especially measured by sensitivity and non-homologous fraction (figure on the left). The figure is slightly misleading, since the statistical model used in the Viterbi algorithm is simpler than the one in the marginal posterior decoding, but the paper shows that the real gain in accuracy is due to the algorithm and not to the underlying model:

We found that more accurate modelling resulted in only very marginal improvements of the alignment accuracy. Indeed, in our simulation study of sequences at human–mouse divergence, the modeling of indel lengths using a mixed geometric distribution resulted in the single largest improvement in sensitivity, from 85.3% to 85.6% using Viterbi decoding, and from 87.8% to 88.2% using MPD. The geometric mixture model helps to align sequences across large indels, which are relatively infrequent, explaining the relatively modest improvement. Modeling the variation in GC content reduces the false-positive fraction (from 15.2% to 13.6% using MPD), but has little effect on sensitivity. Surprisingly, accurate modeling of indel and substitution rate variation has little, if any, effect. This robustness to misparameterization is supported by our simulations under the Jukes–Cantor model, where substantial variations in the rate parameters resulted in very little difference.

So what?

What is the consequence of the biases introduced by trusting incorrect alignments?

It is not completely obvious to me.

If we move indels around to achieve higher sequence similarity, we end up underestimating the number of substitutions, of course, which means we will tend to underestimate divergence time. The effect depends, of course, on the number of indels between the sequences, since the bias only shows close to indels and if the alignment mainly consists of nucleotides well away from gaps. This means closely related sequences, though.

Improving the inferred alignment, using methods as those introduced here, is a help, of course, but we are still in the situation where we infer an alignment and then treat it as “truth” in the further analysis.

It seems to me that we would be better off carrying the uncertainty over to the further analysis, either by incorporating parameter estimation and such in the statistical alignment algorithms, or by weighing alignment columns by their posterior probability in the further analysis.

Details left to the reader, of course ;-)


Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., Hein, J. (2008). Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Research, 18(2), 298-309. DOI: 10.1101/gr.6725608