Approximate pattern matching

Tomorrow I’m teaching string algorithms covering approximate pattern matching and the Wu-Manber algorithm.

I’m actually also teaching genome analysis but Mikkel is giving the lecture tomorrow, so I don’t have to worry about that.

The good thing about string algorithms is that I have taught it several times before, so there is very little preparation time. I probably ought to spend some more time for it this time, ’cause I don’t particularly find approximate pattern matching that intersting in this class (it is more interesting in algorithms in bioinformatics, the class that Storm is teaching) so I wanted to replace it with something else this year, but didn’t find the time.

Just for the fun of it, I’ve started using Slideshare to publish my presentations. I also put the slides on the course homepage, of course, but with Slideshare I can put the presentations directly on the web like this:



Now isn’t that cool?

Whether the slides make sense without someone presenting them, I don’t know. In some sense I hope not, because then I am really wasting my time giving the lectures…

What is association mapping?

My main research area at the moment is association mapping, and in this post I will try to explain what association mapping is, what the challenges are and what the approaches used are. It is a bit of a long post, but I hope I can keep your interest all the way through.

The description is a bit simplified. I wanted to explain my research for a broad audience, so I am leaving out a lot of details. I guess what I am saying is that you shouldn’t try to get an academic degree from reading a blog ;-) Still, I hope you get something out of it, and if you want to know more, do not hesitate to contact me.

Genetic variation and association with disease risks

We are not all genetically identical. If we were, we would all look like identical twins. It is not that there is a large difference between our genes. After all, there is only a few percent differences between humans and chimpanzees, and jokes asside there is a huge difference between humans and chimps compared to the difference between individual humans. But there is some genetic difference between us.

Our genes are part of what determines our phenotype. The phenotype is the “observable quality of an organism” (see e.g. Wikipedia) so think of things such as physical appearance but also inherited “gifts”: a gift for music, a gift for long-distance running, etc. The phenotype is not completely determined by the genes — even with the right “genes for long-distance running” you will not win the olympics Marathon without some serious training — but a part of what determines the phenotype is in the genes. Some people, myself included, would never make it to the olympics Marathon regardless of the amount of training we put into it.

Part of our phenotype is the risk we have of getting various diseases. You probably know some people who never seem to catch a cold. When everyone else is sneezing and coughing, these people are as healthy as always. In the rare occations when the do catch the cold, it is over in a day or two, while the rest of are down for a week at least. Other people, and you probably also know a few of those, will always catch any virus going around.

The genetic component of a disease — the part of the risk of getting the disease that is determined by the genes — varies a lot from disease to disease. We can get a rough idea of the magnitude of the genetic component by looking at the family of disease cases: if, when you consider the relatives of a disease case, you are more likely to find another case than you would if you considered a random individual and his relatives, then the disease probably has a non-negliable genetic component. This can be formalised through statistics, so we can actually measure the magnitude of genetic componets of a disease and compare the genetic contribution to disease risk between different diseases.

For the common cold or the flu, this isn’t quite so simple, of course, since we need to take into account if the relatives of our disease case have even been exposed to the virus in the first place, of if they have already had the cold but recovered, and so on. For more serious diseases, such as cancers, it is simpler to figure out if there are more cases among the relatives of cases than among relatives of non-cases, but there are still environmental factors to consider, like exposure to the virus in our common cold example. That being said, it is not a major challenge for statistics to figure out if a disease has a major or minor genetic component, and we now know that several serious diseases have major genetic components: several cancers, diabetes and schizophrenia are examples of such diseases.

Searching for disease genes

If we know that a given disease has a significant genetic component, the obvious follow-up question is: which genes are actually responsiblel for the disease risk?

It is a simple question, but coming up with the answer is more difficult than you might imagine.

The problem is, in essence, that the genome is very large, so there are lots of places the disease affecting genes can hide. There are around three billion nucleotides in the human genome (letters A, C, G and T that write out all our genetic material), and any of these can, in theory, be affecting the disease risk.

Luckily, we do not need to examine all of these. The way genes are passed from parents to children through the generations causes the individual nucleotides to be statistically linked in the sense that knowing about one nucleotide gives you information about many others. Instead of having to look at three billion nucleotides, we can gain essentially the same information from five hundred thousands to a million nucleotides. We just have to be careful in selecting the right million nucleotides.

Over the last decade, several large projects (most notable the Human HapMap Project) have focused on mapping the genetic variation in humans andthe knowledge gained through these projects enables us to pick the right nucleotides to examine.

Now all we need to do is look at those selected nucleotides. What we are looking for, then, is nucleotides where there is a difference between disease cases and people not affected by the disease. This, again, is a statistical problem, and not even a particularly difficult one at that.

This works in theory, anyway. It is complicated, however, by the relatively small effect each nucleotide usually have on the disease risk. Even for diseases with a very high gentic component, the individual genes will not necessarily have a major effect on the disease risk. You need several unlucky genes to add up to a high risk, so if you look at them individually, you might be looking at very small effects.

This is a problem called statistical power, or rather the lack of power, and to overcome this you either need to consider a very large sample of cases or you need more sophisticated statistical analysis. Most likely, you need both.

The number of cases is mainly a matter of cost. Each new case you add to a study costs about $1000. For rare diseases, of course, there might not be enough people with the diease, but mainly it is a problem of cost.

The problem with sophisticated statistical methods is mainly a matter of computational efficiency. A lot of very clever methods requires hours or days of computer time. If you want to apply such methods on a million nucleotides we are easily talking years of computer time.

The later is the problem I am personally doing my research on: development of computationally efficient, yet still statistically powerful, methods.

What use is it, really?

Okay, let’s assume we discover a set of genes that affect the risk of a disease. What then?

Despite the hype that is usually surrounding association studies, there isn’t really that much of an initial benefit in knowing which genes affect a disease. If you have a gene that doubles your risk of lung cancer, it might be increasing your risk from 0.5% to 1%. I personally doubt that anyone will change their life over this knowledge. After all, consider how many people are still smoking, knowing full well that the increased risk for lung cancer is orders of magnitude higher than anything your genes might contribute.

The benefits in knowing which genes affect a disease are downstream in research. Just knowing which genes have an effect on a diesease tells us something about the disease. Follow-up research on the genes will tell us how the gene affects the disease which in turn might tell us how to prevent or cure the disease.

Locating the genes is the first step. The first of many. It is a very important step, but please don’t expect the discovery of a new “cancer gene” to be followed by a cure for cancer in a year or two. We are only at the very beginning of understanding how our genes affect our health.

We are taking the very first steps in a very important journy.

Money, money, money…

I’ve just found out that I have DKK 100,000 left on the grant I am funded by until Feb 1st next year (where I move to a different grant but roughly the same research project).That’s a lot of money left to spend in a month and a half.The reason I have that much left is twofold: 1) in the budget I added a salary increase one year into the project that I didn’t get until half a year later (apparently I was too young after one year to move from assistant professor to associate professor level) and 2) I never used the travel budget because most of my travelling instead was paid for by the PolyGene project.Now I’m looking for ways to spend that money.I’ve ordered a new iMac with as high a spec as I can reasonably get, but that only costs about 20,000.For the rest I am thinking about either adding machines to our Linux cluster at BiRC or — if possible through some clever bookkeeping — find a way to use the money over the next year on a student programmer.The later is the nicer solution, ’cause I really have all the computer power I need through Brian Vinter’s grid if only I finish the framework for accessing it, so I will get more use out of a programmer to help with that (and all the other software development we need in the association mapping group) that I would get out of buying more computers.Not that getting more computers for our cluster would be wasted — we are pushing the limit of our system on a weekly basis and using our own cluster has some benefits that are missing on Brian’s grid — I just think I can get more out of a programmer.Of course, getting a programmer requries permission to spend the money over the next year rather than before the end of the grant, so I am not sure how to go about achieving that, but I’ll talk to our accountants to see if it is possible.

Teaching a course the nth time is so much easier than teaching it the first time

I’ve just finished preparing slides for my lecture in string algorithms tomorrow.It took about 15 minutes, and that was spend on reformatting the slides to the new version of Impress (for some reason the text always jumps about a bit when the version of OOo changes).In comparison I spent all Sunday preparing for my lectures in genome analysis, a class I am teaching for the first time this term. Having a clear idea about what to cover in the lecture, and of course having old slides to pick from when preparing the presentation, really speeds up the preparation.I guess one of the reasons I have managed so little research this year is that I have taken on three completely new courses (in addition to two old ones) to teach. Next year I only have one new class to teach and four that I have already taught before, so that isn’t so bad.

It’s alive, it’s aliiiiiiive!

That’s right, my blog is up and running again.

About a year ago I took down my blog to save the web server. My home-grown blogging software couldn’t cope with the traffic and the very inefficient backend data representation I used for the postings and archives so I decided to take it down for a bit while I found some software elsewhere to run the blog on. that took a while.

Now I’m running wordpress and I am looking forward to trying it out. So far I am quite happy with it, but then so far I have really only played with it and not done any “serious” blogging, so we will see how it goes. Anyway, with wordpress I have the option of hacking the underlying code, so I’ll get flexibility similiar to my homegrown code, but without the inefficiency. That isn’t bad in my book.

For the layout I’ve just picked a theme from wordpress and tweaked it a little bit. I guess I am going to play with the themes a bit more before I’ll worry about writing my own theme. For one thing, I am getting the impression that I need to understand PHP before I can do much on the blog, so I’ll have to look into that, but my time is very limited these days so I don’t know when I’ll get around to it.