When did humans split from the apes anyway?

During some random surfing I stumbled upon these two blog posts:

both by John Hawks.

I found these interesting not least because he refers to a paper that we published earlier this year:

Hobolth A, Christensen OF, Mailund T, Schierup MH. 2007. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet 3:e7. doi:10.1371/journal.pgen.0030007

That paper was mainly on a new statistical method for analysing speciation. A method that combined comparative genomics with population genetics through a model that joined hidden Markov models with coalescence theory. Of course, that is not really what caught people’s attention. What we did in the paper was to apply our new method on data from human, gorrilla, chimp and orangutan, and one result that came out of that was a very recent split between human and chimp; a split only 4.1 million years old.

We get a very resent speciation split between human and apes exactly because of the combined population genetics and genomics. If we only look at the genomic sequences, the distance between these will necessarily be larger than the distance between the species — it takes a while from the time a piece of DNA is in the same individual until it is two different individuals in separate species — and our method is able to estimate the speciation split from the genome split.

I’m not sure how well I am explaining this here. I gave a (not too technical) talk in the computer science department some months ago, maybe that explains it better:

(sorry about the quality of the slides here, it looks like slideshare messed up the fonts)

A few other studies of genomic data before our own also reported more recent speciation times of human and chimp than previously believed — moving the time from about 6-8 million years ago down to maybe 4-5 million years ago — so a recent divergence between human and chimp might not be too far fetch after all, but still, I think our estimate is a bit too recent.

This is also what John Hawks writes.

Why do we get such a recent divergence, then?

It is hard to say. The 4.1 million years is what comes out of applying our method on the (admittedly small) data we had. It is a very new method, however. There is a lot we do not take into account in it and there might be biases in it we haven’t fully understood yet.

We are currently working on improving the method and once we get more data — the orangutan has already been sequenced and is now being assemblied and the gorilla genome is in the process of being sequenced — we will redo our analysis. It will be interesting to see how that turns out.

Approximate pattern matching

Tomorrow I’m teaching string algorithms covering approximate pattern matching and the Wu-Manber algorithm.

I’m actually also teaching genome analysis but Mikkel is giving the lecture tomorrow, so I don’t have to worry about that.

The good thing about string algorithms is that I have taught it several times before, so there is very little preparation time. I probably ought to spend some more time for it this time, ’cause I don’t particularly find approximate pattern matching that intersting in this class (it is more interesting in algorithms in bioinformatics, the class that Storm is teaching) so I wanted to replace it with something else this year, but didn’t find the time.

Just for the fun of it, I’ve started using Slideshare to publish my presentations. I also put the slides on the course homepage, of course, but with Slideshare I can put the presentations directly on the web like this:

 

 

Now isn’t that cool?

Whether the slides make sense without someone presenting them, I don’t know. In some sense I hope not, because then I am really wasting my time giving the lectures…

What is association mapping?

My main research area at the moment is association mapping, and in this post I will try to explain what association mapping is, what the challenges are and what the approaches used are. It is a bit of a long post, but I hope I can keep your interest all the way through.

The description is a bit simplified. I wanted to explain my research for a broad audience, so I am leaving out a lot of details. I guess what I am saying is that you shouldn’t try to get an academic degree from reading a blog ;-) Still, I hope you get something out of it, and if you want to know more, do not hesitate to contact me.

Genetic variation and association with disease risks

We are not all genetically identical. If we were, we would all look like identical twins. It is not that there is a large difference between our genes. After all, there is only a few percent differences between humans and chimpanzees, and jokes asside there is a huge difference between humans and chimps compared to the difference between individual humans. But there is some genetic difference between us.

Our genes are part of what determines our phenotype. The phenotype is the “observable quality of an organism” (see e.g. Wikipedia) so think of things such as physical appearance but also inherited “gifts”: a gift for music, a gift for long-distance running, etc. The phenotype is not completely determined by the genes — even with the right “genes for long-distance running” you will not win the olympics Marathon without some serious training — but a part of what determines the phenotype is in the genes. Some people, myself included, would never make it to the olympics Marathon regardless of the amount of training we put into it.

Part of our phenotype is the risk we have of getting various diseases. You probably know some people who never seem to catch a cold. When everyone else is sneezing and coughing, these people are as healthy as always. In the rare occations when the do catch the cold, it is over in a day or two, while the rest of are down for a week at least. Other people, and you probably also know a few of those, will always catch any virus going around.

The genetic component of a disease — the part of the risk of getting the disease that is determined by the genes — varies a lot from disease to disease. We can get a rough idea of the magnitude of the genetic component by looking at the family of disease cases: if, when you consider the relatives of a disease case, you are more likely to find another case than you would if you considered a random individual and his relatives, then the disease probably has a non-negliable genetic component. This can be formalised through statistics, so we can actually measure the magnitude of genetic componets of a disease and compare the genetic contribution to disease risk between different diseases.

For the common cold or the flu, this isn’t quite so simple, of course, since we need to take into account if the relatives of our disease case have even been exposed to the virus in the first place, of if they have already had the cold but recovered, and so on. For more serious diseases, such as cancers, it is simpler to figure out if there are more cases among the relatives of cases than among relatives of non-cases, but there are still environmental factors to consider, like exposure to the virus in our common cold example. That being said, it is not a major challenge for statistics to figure out if a disease has a major or minor genetic component, and we now know that several serious diseases have major genetic components: several cancers, diabetes and schizophrenia are examples of such diseases.

Searching for disease genes

If we know that a given disease has a significant genetic component, the obvious follow-up question is: which genes are actually responsiblel for the disease risk?

It is a simple question, but coming up with the answer is more difficult than you might imagine.

The problem is, in essence, that the genome is very large, so there are lots of places the disease affecting genes can hide. There are around three billion nucleotides in the human genome (letters A, C, G and T that write out all our genetic material), and any of these can, in theory, be affecting the disease risk.

Luckily, we do not need to examine all of these. The way genes are passed from parents to children through the generations causes the individual nucleotides to be statistically linked in the sense that knowing about one nucleotide gives you information about many others. Instead of having to look at three billion nucleotides, we can gain essentially the same information from five hundred thousands to a million nucleotides. We just have to be careful in selecting the right million nucleotides.

Over the last decade, several large projects (most notable the Human HapMap Project) have focused on mapping the genetic variation in humans andthe knowledge gained through these projects enables us to pick the right nucleotides to examine.

Now all we need to do is look at those selected nucleotides. What we are looking for, then, is nucleotides where there is a difference between disease cases and people not affected by the disease. This, again, is a statistical problem, and not even a particularly difficult one at that.

This works in theory, anyway. It is complicated, however, by the relatively small effect each nucleotide usually have on the disease risk. Even for diseases with a very high gentic component, the individual genes will not necessarily have a major effect on the disease risk. You need several unlucky genes to add up to a high risk, so if you look at them individually, you might be looking at very small effects.

This is a problem called statistical power, or rather the lack of power, and to overcome this you either need to consider a very large sample of cases or you need more sophisticated statistical analysis. Most likely, you need both.

The number of cases is mainly a matter of cost. Each new case you add to a study costs about $1000. For rare diseases, of course, there might not be enough people with the diease, but mainly it is a problem of cost.

The problem with sophisticated statistical methods is mainly a matter of computational efficiency. A lot of very clever methods requires hours or days of computer time. If you want to apply such methods on a million nucleotides we are easily talking years of computer time.

The later is the problem I am personally doing my research on: development of computationally efficient, yet still statistically powerful, methods.

What use is it, really?

Okay, let’s assume we discover a set of genes that affect the risk of a disease. What then?

Despite the hype that is usually surrounding association studies, there isn’t really that much of an initial benefit in knowing which genes affect a disease. If you have a gene that doubles your risk of lung cancer, it might be increasing your risk from 0.5% to 1%. I personally doubt that anyone will change their life over this knowledge. After all, consider how many people are still smoking, knowing full well that the increased risk for lung cancer is orders of magnitude higher than anything your genes might contribute.

The benefits in knowing which genes affect a disease are downstream in research. Just knowing which genes have an effect on a diesease tells us something about the disease. Follow-up research on the genes will tell us how the gene affects the disease which in turn might tell us how to prevent or cure the disease.

Locating the genes is the first step. The first of many. It is a very important step, but please don’t expect the discovery of a new “cancer gene” to be followed by a cure for cancer in a year or two. We are only at the very beginning of understanding how our genes affect our health.

We are taking the very first steps in a very important journy.

Money, money, money…

I’ve just found out that I have DKK 100,000 left on the grant I am funded by until Feb 1st next year (where I move to a different grant but roughly the same research project).That’s a lot of money left to spend in a month and a half.The reason I have that much left is twofold: 1) in the budget I added a salary increase one year into the project that I didn’t get until half a year later (apparently I was too young after one year to move from assistant professor to associate professor level) and 2) I never used the travel budget because most of my travelling instead was paid for by the PolyGene project.Now I’m looking for ways to spend that money.I’ve ordered a new iMac with as high a spec as I can reasonably get, but that only costs about 20,000.For the rest I am thinking about either adding machines to our Linux cluster at BiRC or — if possible through some clever bookkeeping — find a way to use the money over the next year on a student programmer.The later is the nicer solution, ’cause I really have all the computer power I need through Brian Vinter’s grid if only I finish the framework for accessing it, so I will get more use out of a programmer to help with that (and all the other software development we need in the association mapping group) that I would get out of buying more computers.Not that getting more computers for our cluster would be wasted — we are pushing the limit of our system on a weekly basis and using our own cluster has some benefits that are missing on Brian’s grid — I just think I can get more out of a programmer.Of course, getting a programmer requries permission to spend the money over the next year rather than before the end of the grant, so I am not sure how to go about achieving that, but I’ll talk to our accountants to see if it is possible.

Teaching a course the nth time is so much easier than teaching it the first time

I’ve just finished preparing slides for my lecture in string algorithms tomorrow.It took about 15 minutes, and that was spend on reformatting the slides to the new version of OpenOffice.org Impress (for some reason the text always jumps about a bit when the version of OOo changes).In comparison I spent all Sunday preparing for my lectures in genome analysis, a class I am teaching for the first time this term. Having a clear idea about what to cover in the lecture, and of course having old slides to pick from when preparing the presentation, really speeds up the preparation.I guess one of the reasons I have managed so little research this year is that I have taken on three completely new courses (in addition to two old ones) to teach. Next year I only have one new class to teach and four that I have already taught before, so that isn’t so bad.