My main research area at the moment is association mapping, and in this post I will try to explain what association mapping is, what the challenges are and what the approaches used are. It is a bit of a long post, but I hope I can keep your interest all the way through.
The description is a bit simplified. I wanted to explain my research for a broad audience, so I am leaving out a lot of details. I guess what I am saying is that you shouldn’t try to get an academic degree from reading a blog ;-) Still, I hope you get something out of it, and if you want to know more, do not hesitate to contact me.
Genetic variation and association with disease risks
We are not all genetically identical. If we were, we would all look like identical twins. It is not that there is a large difference between our genes. After all, there is only a few percent differences between humans and chimpanzees, and jokes asside there is a huge difference between humans and chimps compared to the difference between individual humans. But there is some genetic difference between us.
Our genes are part of what determines our phenotype. The phenotype is the “observable quality of an organism” (see e.g. Wikipedia) so think of things such as physical appearance but also inherited “gifts”: a gift for music, a gift for long-distance running, etc. The phenotype is not completely determined by the genes — even with the right “genes for long-distance running” you will not win the olympics Marathon without some serious training — but a part of what determines the phenotype is in the genes. Some people, myself included, would never make it to the olympics Marathon regardless of the amount of training we put into it.
Part of our phenotype is the risk we have of getting various diseases. You probably know some people who never seem to catch a cold. When everyone else is sneezing and coughing, these people are as healthy as always. In the rare occations when the do catch the cold, it is over in a day or two, while the rest of are down for a week at least. Other people, and you probably also know a few of those, will always catch any virus going around.
The genetic component of a disease — the part of the risk of getting the disease that is determined by the genes — varies a lot from disease to disease. We can get a rough idea of the magnitude of the genetic component by looking at the family of disease cases: if, when you consider the relatives of a disease case, you are more likely to find another case than you would if you considered a random individual and his relatives, then the disease probably has a non-negliable genetic component. This can be formalised through statistics, so we can actually measure the magnitude of genetic componets of a disease and compare the genetic contribution to disease risk between different diseases.
For the common cold or the flu, this isn’t quite so simple, of course, since we need to take into account if the relatives of our disease case have even been exposed to the virus in the first place, of if they have already had the cold but recovered, and so on. For more serious diseases, such as cancers, it is simpler to figure out if there are more cases among the relatives of cases than among relatives of non-cases, but there are still environmental factors to consider, like exposure to the virus in our common cold example. That being said, it is not a major challenge for statistics to figure out if a disease has a major or minor genetic component, and we now know that several serious diseases have major genetic components: several cancers, diabetes and schizophrenia are examples of such diseases.
Searching for disease genes
If we know that a given disease has a significant genetic component, the obvious follow-up question is: which genes are actually responsiblel for the disease risk?
It is a simple question, but coming up with the answer is more difficult than you might imagine.
The problem is, in essence, that the genome is very large, so there are lots of places the disease affecting genes can hide. There are around three billion nucleotides in the human genome (letters A, C, G and T that write out all our genetic material), and any of these can, in theory, be affecting the disease risk.
Luckily, we do not need to examine all of these. The way genes are passed from parents to children through the generations causes the individual nucleotides to be statistically linked in the sense that knowing about one nucleotide gives you information about many others. Instead of having to look at three billion nucleotides, we can gain essentially the same information from five hundred thousands to a million nucleotides. We just have to be careful in selecting the right million nucleotides.
Over the last decade, several large projects (most notable the Human HapMap Project) have focused on mapping the genetic variation in humans andthe knowledge gained through these projects enables us to pick the right nucleotides to examine.
Now all we need to do is look at those selected nucleotides. What we are looking for, then, is nucleotides where there is a difference between disease cases and people not affected by the disease. This, again, is a statistical problem, and not even a particularly difficult one at that.
This works in theory, anyway. It is complicated, however, by the relatively small effect each nucleotide usually have on the disease risk. Even for diseases with a very high gentic component, the individual genes will not necessarily have a major effect on the disease risk. You need several unlucky genes to add up to a high risk, so if you look at them individually, you might be looking at very small effects.
This is a problem called statistical power, or rather the lack of power, and to overcome this you either need to consider a very large sample of cases or you need more sophisticated statistical analysis. Most likely, you need both.
The number of cases is mainly a matter of cost. Each new case you add to a study costs about $1000. For rare diseases, of course, there might not be enough people with the diease, but mainly it is a problem of cost.
The problem with sophisticated statistical methods is mainly a matter of computational efficiency. A lot of very clever methods requires hours or days of computer time. If you want to apply such methods on a million nucleotides we are easily talking years of computer time.
The later is the problem I am personally doing my research on: development of computationally efficient, yet still statistically powerful, methods.
What use is it, really?
Okay, let’s assume we discover a set of genes that affect the risk of a disease. What then?
Despite the hype that is usually surrounding association studies, there isn’t really that much of an initial benefit in knowing which genes affect a disease. If you have a gene that doubles your risk of lung cancer, it might be increasing your risk from 0.5% to 1%. I personally doubt that anyone will change their life over this knowledge. After all, consider how many people are still smoking, knowing full well that the increased risk for lung cancer is orders of magnitude higher than anything your genes might contribute.
The benefits in knowing which genes affect a disease are downstream in research. Just knowing which genes have an effect on a diesease tells us something about the disease. Follow-up research on the genes will tell us how the gene affects the disease which in turn might tell us how to prevent or cure the disease.
Locating the genes is the first step. The first of many. It is a very important step, but please don’t expect the discovery of a new “cancer gene” to be followed by a cure for cancer in a year or two. We are only at the very beginning of understanding how our genes affect our health.
We are taking the very first steps in a very important journy.