Exploiting Hardy-Weinberg Equilibrium for association mapping

ResearchBlogging.org Testing single SNP markers for disease association is typically done by comparing the genotype frequencies of cases with those of controls, to see if they differ. The genotype frequencies, of course, must be estimated based on the sampled individuals, and there is some uncertainty in this estimate that might reduce the power. If the genotypes are in Hardy-Weinberg equilibrium (HWE), however, there’s a constraint on them that makes the estimate more accurate, so exploiting this in association mapping could increase the statistical power. This is the idea presented in this paper:

Exploiting Hardy-Weinberg Equilibrium for Efficient Screening of Single SNP Associations fro Case-Control Studies

Chen and Chatterjee

Human Heredity 63: 196-204, 2007


In case-control studies, the assessment of the association between a binary disease outcome and a single nucleotide polymorphism (SNP) is often based on comparing the observed genotype distribution for the cases against that for the controls. In this article, we investigate an alternative analytic strategy in which the observed genotype frequencies of cases are compared against the expected genotype frequencies of controls assuming Hardy-Weinberg Equilibrium (HWE). Assuming HWE for controls, we derive closed-form expressions for maximum likelihood estimates of the genotype-specific disease odds ratio (OR) parameters and related variance-covariances. Based on these estimates and their variance-covariance structure, we then propose a two-degree-of-freedom test for disease-SNP association. We show that the proposed test can have substantially higher power than a variety of existing methods, especially when the true effect of the SNP is recessive. We also obtain analytic expressions for the bias of the OR estimates when the underlying HWE assumption is violated. We conclude that the novel test would be particularly useful for analyzing data from the initial ‘screening’ stages of contemporary multi-stage association studies.

It is actually something we have been playing with ourselves in my group, although for epistasis where the genotype frequencies are much harder to estimate because of very few observations of the rare genotypes. It was suggested to us by Patrick Sulem from DeCODE, but this paper is the first I’ve seen that describes the underlying statistics of it.

Hardy-Weinberg Equilibrium and exploiting it in association mapping

Hardy-Weinberg Equilibrium, or HWE, is a result from population genetics that says that in a random mating population, the proportions of alleles, AA, Aa and aa, is given by p2, 2pq, q2, where p is the allele frequency for A and q=1-p is the allele frequency for a. This equilibrium can, of course, be off in various ways, but in general it is the proportions we expect to observe the three genotypes. Now if the genotypes are in HWE, we need only estimate the allele frequencies (one parameter) rather than the genotype frequencies (two parameters). As a rule of thumb, the fewer parameters we need to estimate before we perform our test, the better off we are. (This is of course something that must be checked from case to case, but in this case it is true…).

Now, if we assume that the population as such is in HWE, and that the genetic effect of the disease is not too severe, then we would expect the controls to be in HWE. So rather than estimating genotype frequencies for the controls, we can instead estimate allele frequencies and get the genotype frequencies from the allele frequencies and the HWE assumption. We can then use these expected genotype frequencies in the association test. For cases we probably cannot assume HWE, at least it is hard to see how cases can be in HWE if the locus has an effect on disease status…

Anyway, in this paper they show that using the expected genotype frequencies — expected under the HWE assumption — the power of the test is improved. Quite dramatically for recessive disease effects and less so for dominant and multiplicative effects.

The HWE assumption might be violated, so to trust the test we must know how robust it is to violations of this assumption. The paper shows that deviations from HWE certainly does affect the test, but will do so by increasing the number of false positives. The authors then suggest that the test can be used to screen GWA data in an initial stage, but that it probably shouldn’t be used in later stages.

Personally, I am a bit curious about how you could go about detecting the degree of Hardy-Weinberg disequilibrium and perhaps compensate for it in the test.  Of course, that would give you another parameter to estimate, so  you might end up with loosing the power gained by assuming HWE, so it might not be the way to go…

Chen, J., Chatterjee, N. (2007). Exploiting Hardy-Weinberg Equilibrium for Efficient Screening of Single SNP Associations from Case-Control Studies. Human Heredity, 63, 196-204.

Googley UI design

What makes a user interface Googley? asks Sue Factor on Google’s blog.  I don’t really know.  Like porn, it is hard to define but I know it when I see it.

The main thing I notice with Google’s services is the simplicity.  Especially in the various “search like” applications (“real” Google, Google Maps, Google Scholar, etc.) you have a simple input bar and a list of results.  It couldn’t be simpler.

The simplicity is combined with a very powerful application, though.  The syntax for searches is extremely powerful, and I keep learning new tricks.  It is a wonderful user experience when the application combines an smooth learning curve with a learning curve that just doesn’t seem to stop climbing…

Another web application with a similar feeling to it as Google’s applications is Amir’s Todoist.  A simple web todo-list I have been using for a while.  Try it out and I think you will agree with me.

Machine Learning slides

In less than two hours, I give my last Machine Learning lecture before I’m heading off to the UK (I’m giving a few more when I get back). The slides from the lectures so far can be seen below:

Course Introduction

This is just a brief overview of what’s to follow. Motivates the topic and such.

Introduction to probability and statistics

This is two lectures covering the basic probability theory and statistics needed for the course. This is probably the most theoretical part of the course, and in my experience the material that confuses the students the most.

Linear Regression

This is the basic linear regression that you learn in just about any statistics class, except that there’s also a bit of Bayesian statistics and some model-selection/over-fitting theory that I haven’t seen in pure statistics classes (the classes I have taken have focused more on hypothesis testing approaches).

To get our students activated at this point, we also give them a simple project where they need to fit data to a linear model and try to predict target values based on their model.

Linear Classification

This, then, is today’s lecture. Classification, where the training consists of changing weights that split the predictor space into linear regions.


There seems to be some problems with some of the slideshare plugins above (some of the text is missing from time to time), but if you are interested you can find the slides (in PDF or OpenOffice format) here and you can see the course description and schedule here.

We need faster machines…

I missed this post earlier, but it is a short little story of things to come … well, it is already happening at Washington University where they are at the forefront of genome sequencing, but the rest of us will feel it soon enough.

With new sequencing technologies, we are creating massive data sets, and the IT infrastructure is becoming a bottleneck.  In my own work — association mapping and genetics — we have already been feeling this for a while, ’cause we are using quite time consuming analysis methods — but with the vast increase in data even the simplest analysis could be a problem.

Apparently, 1600 cores is not enough at WashU.

Computer scientists are becoming more important for biology, I guess, and it is time for those algorithmics guys to get cracking.

Second week of “Machine Learning”

Today is the second week of the “Machine Learning” class I teach with Christian Storm (I teach the first two and last three while Storm takes those in between when I am away in the UK for a trip).

The first week I cover basic probability theory and statistics, and this week I’ll cover linear models, both regression and classification. I get the feeling that this is a lot more mathematics and statistics than my students expected. I got the same impression last time I taught it. It is mainly computer science students, and at least in Aarhus, computer scientists really, really hate statistics.

Of course, it probably doesn’t improve on matters that we have a lot of basic stuff to cover the first week before we can get started on the proper machine learning material. It is not a statistics course, so I don’t want to spend more than a single week on the basic probability theory and statistics, but since this really is the necessary mathematics to understand what follows — and since probability and statistics is no longer a mandatory part of the introductory computer science program — it needs to be introduced.

Having to cover a lot of material in very little time, with students who in general do not really like the topic, is a bit of a problem for me.  I do not feel that my approach is working, but I do not really have any good ideas about what else to do…