Today I’m working on the talk I’m giving next week. I was asked to talk about association mapping, which I should be able to since it has been my main research area for a couple of years, and mainly about my own research.
The latter is a bit of a problem for me, actually.
Although my main research grant is for association mapping, I haven’t actually been doing much work on it the last year or so. I got caught up in our work on coalescent HMMs and that has taken up most of my time, so all my association mapping papers are of results at least a year or two old, and I feel they are kind of dated by now.
I’ll get around some of it by having a large introduction to the field. The basics hasn’t changed much the last couple of years, so that should be ok.
For that part, I think the main points are statistical, having to do with multiple testing correction, power and dealing with the empirical null model. See some of my previous posts on that:
For the last part – my own research – I really only have two things to talk about. Local genealogies and gene-gene interaction. Those are the only topics where I have developed some methods worth talking about, rather than just applied them.
Local genealogies
We’ve done some work in my group on haplotype (multi-marker) methods, where we try to infer local genealogies along the genome to try to get more information about local association with a phenotype out of the data than we could get from just analysing each marker independently.
This is not a new idea, really. There have been plenty of methods with this idea, but most of them are based on statistical sampling and are very time consuming, and therefore not all that useful for genome wide analysis.
What we did, was to take a very crude approach to inferring local trees – using the “perfect phylogeny” method – along the genome and then scoring each tree according to the clustering of cases and controls.
By taking this very simple approach, we get an efficient method that can scan a genome wide dataset of thousands of individuals in a couple of ours (compared to ~10 markers in ~100 individuals in a week, as was the case with the first method I worked on).
So it is a quick and dirty method compared to the more sophisticated sampling approaches – with emphasis on quick.
It also appears to be doing okay when it comes to finding disease markers. When we, in the first paper, compared it to other methods of similar speed we usually performed better or just as well. More importantly, we could find markers of lower frequency than we could if we only tested each tag marker individually. This is especially interesting since the low frequency disease markers are very hard to find with the single marker approach.
You can read about the method in these papers:
Whole genome association mapping by incompatibilities and local perfect phylogenies Mailund, Besenbacher, and Schierup. BMC Bioinformatics 2006 7(454).Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data Ding, Mailund and Song Bioinformatics 2008 24(19):2215-2221.
Gene-gene interaction
The second method concerns epistasis, or gene-gene interaction.
When analysing a genome wide data set, we usually only consider each marker alone, but we would expect some gene-gene interaction to be behind the phenotype we analyse. We know that genes interact in various ways, and it seems unlikely that the only way they affect disease risk is by marginal effects.
The problem with searching for interactions is the combinatorial explosion. With 500k SNP chips, we get around 125 billion ($$10^9$$) pairs and $$2\cdot 10^{16}$$ triples of SNPs. For $$k$$ SNPs we get $$\binom{n}{k}$$ combinations. While it may be computational feasible to test models for small $$k$$, the multiple test correction is definitely going to kill any hope of finding anything.
It is essential to reduce the search space somehow to get anywhere with this.
We published a paper earlier this year about one such approach:
The idea here is to exploit our existing knowledge of gene-gene interactions. We have inferred networks of interactions from systems biology, so we have a good idea about which genes actually interact. Probably not all of them, and we don’t know if the only way genes can interact to cause a disease is through these known interactions or anything, but it is a good place to start.
So what we did was simply to restrict the markers we looked at to be markers from genes known to interact. That brings the number of intereactions to consider down from billions to a few millions, and the corrected significance threshold down to something we actually have the power to detect.
—
211-209=+2