Preparing my talk

Today I’m working on the talk I’m giving next week.  I was asked to talk about association mapping, which I should be able to since it has been my main research area for a couple of years, and mainly about my own research.

The latter is a bit of a problem for me, actually.

Although my main research grant is for association mapping, I haven’t actually been doing much work on it the last year or so.  I got caught up in our work on coalescent HMMs and that has taken up most of my time, so all my association mapping papers are of results at least a year or two old, and I feel they are kind of dated by now.

I’ll get around some of it by having a large introduction to the field.  The basics hasn’t changed much the last couple of years, so that should be ok.

For that part, I think the main points are statistical, having to do with multiple testing correction, power and dealing with the empirical null model.  See some of my previous posts on that:

For the last part – my own research – I really only have two things to talk about.  Local genealogies and gene-gene interaction.  Those are the only topics where I have developed some methods worth talking about, rather than just applied them.

Local genealogies

We’ve done some work in my group on haplotype (multi-marker) methods, where we try to infer local genealogies along the genome to try to get more information about local association with a phenotype out of the data than we could get from just analysing each marker independently.

This is not a new idea, really.  There have been plenty of methods with this idea, but most of them are based on statistical sampling and are very time consuming, and therefore not all that useful for genome wide analysis.

What we did, was to take a very crude approach to inferring local trees – using the “perfect phylogeny” method – along the genome and then scoring each tree according to the clustering of cases and controls.

By taking this very simple approach, we get an efficient method that can scan a genome wide dataset of thousands of individuals in a couple of ours (compared to ~10 markers in ~100 individuals in a week, as was the case with the first method I worked on).

So it is a quick and dirty method compared to the more sophisticated sampling approaches – with emphasis on quick.

It also appears to be doing okay when it comes to finding disease markers.  When we, in the first paper, compared it to other methods of similar speed we usually performed better or just as well.  More importantly, we could find markers of lower frequency than we could if we only tested each tag marker individually.  This is especially interesting since the low frequency disease markers are very hard to find with the single marker approach.

You can read about the method in these papers:

Whole genome association mapping by incompatibilities and local perfect phylogenies Mailund, Besenbacher, and Schierup. BMC Bioinformatics 2006 7(454).
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data Ding, Mailund and Song Bioinformatics 2008 24(19):2215-2221.

Gene-gene interaction

The second method concerns epistasis, or gene-gene interaction.

When analysing a genome wide data set, we usually only consider each marker alone, but we would expect some gene-gene interaction to be behind the phenotype we analyse.  We know that genes interact in various ways, and it seems unlikely that the only way they affect disease risk is by marginal effects.

The problem with searching for interactions is the combinatorial explosion.  With 500k SNP chips, we get around 125 billion ($$10^9$$) pairs and $$2\cdot 10^{16}$$ triples of SNPs.  For $$k$$ SNPs we get $$\binom{n}{k}$$ combinations.  While it may be computational feasible to test models for small $$k$$, the multiple test correction is definitely going to kill any hope of finding anything.

It is essential to reduce the search space somehow to get anywhere with this.

We published a paper earlier this year about one such approach:

Using biological networks to search for interacting loci in genomewide association studies Emily, Mailund, Hein, Schauser and Schierup. European Journal of Human Genetics 2009

The idea here is to exploit our existing knowledge of gene-gene interactions.  We have inferred networks of interactions from systems biology, so we have a good idea about which genes actually interact.  Probably not all of them, and we don’t know if the only way genes can interact to cause a disease is through these known interactions or anything, but it is a good place to start.

So what we did was simply to restrict the markers we looked at to be markers from genes known to interact.  That brings the number of intereactions to consider down from billions to a few millions, and the corrected significance threshold down to something we actually have the power to detect.

211-209=+2

Are women getting more beautiful

The story is all over the net these days, see e.g. here, but are women really getting more beautiful?

Walking down town in the summer time I wouldn’t necessarily say no.  They do look beautiful in their summer dresses and miniskirts, but still… the genetics would have to be a bit special for it to be true, and the statistics isn’t really supporting it.

My bet would be on a classical multiple testing problem, as described here (PDF).  I’m not saying that multiple testing is all there is to it, but I would like to have that ruled out before I believe the story…

209-208=+1

Next week in Copenhagen

Got this by mail today:

Seminar Series on Human Population Genetics

As part of the PhD summer course in Human Population Genetics Analyses from the 3rd of August to the 7th of August 2009 at the University of Copenhagen, the Department of Biology will host a seminar series with distinguished Danish and international researchers in the field of Human Population Genetics.  The lectures are open to the public.

Monday the 3rd of August 2009
Montgomery Slatkin
Department of Integrative Biology, UC-Berkeley
Population genetics of the Neanderthal genome project

Tuesday the 4th of August 2009
TBA

Wednesday the 5th of August 2009
Thomas Mailund
Bioinformatics Research Center, University of Aarhus
Open problems in association mapping

Thursday the 6th of August 2009
Anders Albrechtsen
Department of Biostatisitcs, University of Copenhagen
New methods for modeling large scale human genetic variation data

Friday the 7th of August 2009
Andrew G. Clark
Department of Development and Genetics, Cornell University
Population genetic attributes of rare alleles – a deep resequencing study

All lectures will be held at 4:15 PM in room 1.2.03 at the Biocenter, Ole Maaloes Vej 5, 2200 Copenhagen N, except for the lecture Thursday, which will be held at 4:00 PM in Chr. Hansen Auditorium at CSS, University of Copenhagen, Øster Farimagsgade 5, building 34

If you are in Copenhagen next week, I will recommend you go to some of these. I bet they will be quite interesting.

I will certainly be there Wednesday, for obvious reasons, but I’ll probably go Monday as well.  I would prefer to be there all week, but it might be a bit later for finding a place to stay and such, unless I can find a place to crash…

209-207=+2

What will processors look like in 2020?

Gene Frantz asks this question at embedded.com:

I have challenged several of our senior technologists to think about what the state of the art will be in the year 2020. You might say that we need to have 20/20 vision for the year 2020. I have invited a number of technologists to provide their point of view (POV) of what the state of the art in IC technology will be in the year 2020, and I’m interested to hear what you have to say on the topic. But, since this is my blog, I will have the first and last word on what the year 2020 will hold for us.

My guess, which most people will probably agree with, is that 1) clock rate will not be much different from today, 2) the memory architecture (levels of cache, RAM, disk…) will still have orders of magnitude differences in access time, and 3) we are going to see parallelisation – and multiple cores – in a big way.

This means that the (computer science) theoretical RAM model is going to be increasingly bad at modeling real computers.  Access time is not constant and execution is not sequential.

The PRAM model will probably be pretty good at dealing with multiple cores (where it isn’t really that good for modeling distributed computing).

I’m not sure which models there are for dealing with memory hierarchies.  I know there are some, but there were no classes on this when I studied, and I haven’t kept up with this… I know there are cache-oblivious algorithms – I have friends at the CS department who works on this – but I don’t really know much about it.  I should probably start worrying about it before 2020…

209-206=+3

Last week in the blogs

The last two weeks I’ve been busy with writing a grant proposal, so I haven’t had much time to read (much less write) blogs, but here’s a list of the posts that I did have time to read and enjoy…

Biology

Human ancestry

Research Life

Software

Space exploration

208-205=+3