On non-genic disease polymorphism

Daniel MacArthur discusses genome-wide association studies which has so far mainly found disease associated polymorphisms outside of genes.

The claim in question is that the tendency of GWAS to find disease associations outside of protein-coding genes is somehow a problem; but, as p-ter notes, there’s perfectly plausible reasons for disease risk variants to be found in non-coding regions.

Indeed, I think most of us working in genomics have seen the proliferation of non-coding hits in GWAS studies as a positive, in that it seems to be teaching us something new and unexpected about the underlying biology of human variation.

There is a problem with polymorphisms outside of genes.  We generally have no idea how they functionally affect us to increase or decrease the disease risk.  If we have no idea what a given polymorphism means in terms of function, it is harder to work out; we don’t really know where to start with figuring it out.

As far as I can see, though, that is the only problem with that.

That’s it, though, as far as I can see.  If the polymorphism is statistically significant associated with the disease, and we can replicate this in independent data, then that is what the data is saying.  It might be inconvenient, but tough luck!  No one promised us that this would be easy.

Quoting from Gene Expression:

Their answer to this rhetorical question is that common SNPs (used on current genotyping platforms) are generally nonfunctional. The alternative, the evidence for which I’ll present here, is that our ability to predict functional SNPs is poor. In the phrase “no known function”, the emphasis should be on the word “known”.

GWA studies have been a great success in locating polymorphisms associated with disease, that we can actually replicate.

Sure, we are working with very large data sets here, and false positives is a major problem (see e.g. here and here), but this is a problem we can handle.

And sure, GWA lets us find only the CD/CV type of disease associations and not all diseases will follow this pattern, but with the success of GWA studies so far, I think it is fair to say that there are enough to be found here to make it worthwhile!

Schizophrenia: The Mystery of the Missing Genes

Neuroskeptic has a very interesting post on the (basically failed) search for schizophrenia genes through genome wide association studies.  GWAS have in general been extremely successful, but studies of schizophrenia just do not find anything.  Even pooling data in meta analysis doesn’t get you far.

(Hat tip to Gene Expression)


Preparing my talk

Today I’m working on the talk I’m giving next week.  I was asked to talk about association mapping, which I should be able to since it has been my main research area for a couple of years, and mainly about my own research.

The latter is a bit of a problem for me, actually.

Although my main research grant is for association mapping, I haven’t actually been doing much work on it the last year or so.  I got caught up in our work on coalescent HMMs and that has taken up most of my time, so all my association mapping papers are of results at least a year or two old, and I feel they are kind of dated by now.

I’ll get around some of it by having a large introduction to the field.  The basics hasn’t changed much the last couple of years, so that should be ok.

For that part, I think the main points are statistical, having to do with multiple testing correction, power and dealing with the empirical null model.  See some of my previous posts on that:

For the last part – my own research – I really only have two things to talk about.  Local genealogies and gene-gene interaction.  Those are the only topics where I have developed some methods worth talking about, rather than just applied them.

Local genealogies

We’ve done some work in my group on haplotype (multi-marker) methods, where we try to infer local genealogies along the genome to try to get more information about local association with a phenotype out of the data than we could get from just analysing each marker independently.

This is not a new idea, really.  There have been plenty of methods with this idea, but most of them are based on statistical sampling and are very time consuming, and therefore not all that useful for genome wide analysis.

What we did, was to take a very crude approach to inferring local trees – using the “perfect phylogeny” method – along the genome and then scoring each tree according to the clustering of cases and controls.

By taking this very simple approach, we get an efficient method that can scan a genome wide dataset of thousands of individuals in a couple of ours (compared to ~10 markers in ~100 individuals in a week, as was the case with the first method I worked on).

So it is a quick and dirty method compared to the more sophisticated sampling approaches – with emphasis on quick.

It also appears to be doing okay when it comes to finding disease markers.  When we, in the first paper, compared it to other methods of similar speed we usually performed better or just as well.  More importantly, we could find markers of lower frequency than we could if we only tested each tag marker individually.  This is especially interesting since the low frequency disease markers are very hard to find with the single marker approach.

You can read about the method in these papers:

Whole genome association mapping by incompatibilities and local perfect phylogenies Mailund, Besenbacher, and Schierup. BMC Bioinformatics 2006 7(454).
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genome Data Ding, Mailund and Song Bioinformatics 2008 24(19):2215-2221.

Gene-gene interaction

The second method concerns epistasis, or gene-gene interaction.

When analysing a genome wide data set, we usually only consider each marker alone, but we would expect some gene-gene interaction to be behind the phenotype we analyse.  We know that genes interact in various ways, and it seems unlikely that the only way they affect disease risk is by marginal effects.

The problem with searching for interactions is the combinatorial explosion.  With 500k SNP chips, we get around 125 billion ($$10^9$$) pairs and $$2\cdot 10^{16}$$ triples of SNPs.  For $$k$$ SNPs we get $$\binom{n}{k}$$ combinations.  While it may be computational feasible to test models for small $$k$$, the multiple test correction is definitely going to kill any hope of finding anything.

It is essential to reduce the search space somehow to get anywhere with this.

We published a paper earlier this year about one such approach:

Using biological networks to search for interacting loci in genomewide association studies Emily, Mailund, Hein, Schauser and Schierup. European Journal of Human Genetics 2009

The idea here is to exploit our existing knowledge of gene-gene interactions.  We have inferred networks of interactions from systems biology, so we have a good idea about which genes actually interact.  Probably not all of them, and we don’t know if the only way genes can interact to cause a disease is through these known interactions or anything, but it is a good place to start.

So what we did was simply to restrict the markers we looked at to be markers from genes known to interact.  That brings the number of intereactions to consider down from billions to a few millions, and the corrected significance threshold down to something we actually have the power to detect.


False positives and large sample sizes

This is a followup on a post I wrote a few days ago, so you might want to read that before you continue here…

False positives are observations that are really from the null distribution, but have a p-value below the significance threshold, and so are still considered significant.  As I wrote in the earlier post, if you increase the sample size you can increase your power – the probability that an observation from the alternative distribution is significant – but it will not decrease the chance of false positives.

When you increase the sample size, you decrease the sampling variance so it becomes easer to distinguish between the null distribution and the alternative distribution, but you also adjust the threshold for significans so it matches the significance threshold.  If before you used 5% significane and would threfore consider 5% of the null distribution observations as significant, you will still accept 5% of them as significant.

That is how it works in theory.  In practise there is a problem with false positives and sample size.

It is often the case that the mathematical null distribution is not really capturing the null hypothesis we are interested in.  There is noise in the data that has nothing to do with either null or alternative hypothesis, but noise that moves the data away from the theoretical null distribution we use for our statistical test.

This is by no means a problem for all statistical studies – there are lots of cases where the null distribution does exactly capture what we are testing for – but it does occur often enough that it is worth keeping in mind when you design a study.

Genome wide association mapping

As before, I will use genome wide association mapping as an example.

The goal here is to examine the genome wide genetic variation in cases and controls for some disease, to identify the markers where there is an association between the genotype and the disease.  The natural null distribution would then of course be no association and the alternative some association.

What happens if our sample of cases and controls is not taken from a single genetic homogene population, but sampled from two populations with some genetic difference?

The two populations would have different allele frequencies along the genome.  Large differences at some markers, small at others, but we would expect some differences.  If we are looking at a genetic disease, it is therefore natural to expect that the disease frequency varies as well, between the two populations: if some of the at-risk alleles are in higher frequency in one of the populations, then the population risk is higher.

If we sample cases and controls at random, one population would be over-represented in the cases and under represented in the controls, compared to the joint population as a whole.

If we test the genetic markers against the null distribution of no genetic difference between cases and controls, we are testing the wrong hypothesis!  We should be testing for association between marker and disease, but since we cannot do that, we are testing if the allele frequencies are different between the cases and controls.  It is the best we can do, of course, but the problem is that the null distribution is not actually capturing our null hypothesis.

With a population structure like this, we really do expect the cases and controls to have different allele frequencies in pretty much all markers.  For those markers that really truely are associated with the disease, we expect the frequency differences to be more extreme – otherwise the exercise would be futile – but we do not really expect zero difference in the other markers.

The situation I have described might sound a bit artificial.  After all, why not just sample all cases and controls from the same population? Or take the population structure into account when doing the test?

The problem is that we usually do not know that we are sampling from different populations.  There is genetic variation within what we would consider the same population, if we sample from different geographic areas.  And since people tend to move around, we would have a problem even if we sampled all individuals from the same small town, because some would have ancestors from different parts and that would give us genetic variation just from that.

Consider the plot below:

These show the distribution of test values from the WTCCC study.  The gray areas show where we would expect the values to be if they were sampled under the null distribution, and in general the actual observations are above that.

In this study, all samples are from the UK and a priori from the same population.  Unless a very large fraction of the tested markers are actually associated with the diseases – a very unlikely scenario – the null distribution simply does not match the null hypothesis.

They did actually test for population structure in that study, and clearly demonstrated that it was there, but you will have to read the paper for that.

Population structure is just one source of noise that can show up in genome wide association studies.  There are plenty of others.

The point is not so much that population structure is a problem for association mapping – it is, and it is a big problem – the point is that the obvious null distribution does not match the null hypothesis.

Increasing the sample size

What happens if we increase the sample size?

We make it easier for those observations that are not from the null distribution to be found to be significant.  If the data is from a mixture of two distributions – one that shows population differences and one that show population differences and also disease association – neither of which are the null distribution we test again, then we will see more and more significant results.

Mathematically this is as it should be.  We are rejecting the null hypothesis when it is not true, and really we should be rejecting the null hypothesis for all the observations because none of them really are from the null distribution.

This doesn’t mean that increasing the sample size is a bad idea.  Let me just make that absolutely clear.  Increasing the sample size will make it easier to distinguish between makers that are associated with the disease and markers that are not.

A simple illustration is shown below, where I plot two normal distributions – red and blue – against a null distribtution – black dashed line.  As the variance decreases, samples from the red and blue densities will become easier to distinguish, but both blue and red density will also become easier to distinguish from the null distribution.

The problem is just that if you use the null distribution to pick significant values, you will underestimate the number of false positives – compared to your actual null hypothesis, not the null hypothesis you actually test against – and this error will increase with the sample size.

When you need very large sample sizes to see the true signals in the data – as you do for genome wide association tests – this becomes a real problem.  Even if you correct for the large number of tests – and therefore control for the number of false positives you expect to see – you probably still will see lots of false positives.  Many more than you would expect if the false positives were really sampled from the null distribution you use in your tests.


Mixing problems

No, not mixing a lot of different problems, but problems with mixing.  In an MCMC.

So, I’ve implemented this quantitative trait model in HapCluster this week, and I am now testing how it works.  The first few tests when pretty well, but now I’m running into problems with some other data sets.

HapCluster is a haplotype based fine mapping tool that essentially has a single parameter of interest, $$x$$, that is the position of a trait marker (either a risk marker for a disease or in the new model a marker affecting a quantitative trait).  The model has a lot of nuisance parameters as well, used for comparing haplotypes and clustering them locally around $$x$$.

Only three of the parameters are continuous real values, $$x$$, $$\delta$$ and $$\gamma$$.  These are easy to work with (compared to clustering of haplotypes or inferred ancestral haplotypes and parameters like that), so I use these three to check for convergence and to get an idea about the effective sample size.

A run with HapCluster starts out with two “burn in” runs, used to test convergence and estimate the effective sample size.  I do these to runs, and then test if the distribution of the three parameters is the same in the two runs.  If they are, I conclude that the chain has converged, otherwise I do a third run, compare it with the second run, and so forth.

After each run, I increase the thinning – that is the number of parameter points I skip between each sample.  If the chain is mixing poorly, I need more data points to get an estimate of the true distribution, and I can either get that by making longer runs, or by increasing the thinning to reduce the auto-correlation.  It is the latter I use in the burn-in period.

If, after a number of burn in iterations, I conclude that the chain has converged, I estimate the effective sample size (which is smaller than the actual sample size due to the auto-correlation) and then figure out how many samples I need to make to get the desired effective sample size, and then I run a final chain with that many iterations (using the thinning from the last burn in iteration).

All this works pretty well for case/control data, except that the $$\delta$$ and $$\gamma$$ parameters some times are mixing poorly (have a high auto-correlation).  Not too bad, though, and usually the increased thinning in a burn in iteration or three solves the problem.  In any case, they are nuisance parameters and do not seem to affect $$x$$ much anyway.

With the quantitative traits model, though, it looks like it is $$x$$ that has problems with mixing, and worse, it gets stuck far from the trait position in some cases.

In the plot here, the dashed red line is the position of the trait marker and the different colours in the parameter estimates corresponds to different burn in iterations.

I’m not sure exactly how to deal with this.  Mixing is always causing me problems when I work with MCMCs…