TED: Teaching math

February 3rd, 2011

A little while ago, I voiced my desperation about teaching stats to computer scientists. Or rather the problems of framing this so they would find it interesting.

I went with some problem descriptions and trying to reduce stats to “figuring out what is going on” rather than arithmetics. I don’t think I nailed it that well, but at least I tried.

This TED talk is quite relevant to this:

Math is not arithmetic, but ways of reasoning about the world in a quantitative way.

The introduction to Modern Heuristics, that describes that without a context getting you thinking about the right models, most people cannot solve rather elementary math problems. Not even math professors. Math is about formulating the models, identifying the problems, it is not about doing equations.

I wish I was better at teaching this.

The arithmetic (or in general most of the equation manipulation) is something computers do better. Let’s focus on modeling the world.

You give science a bad name!

February 3rd, 2011

I wasn’t going to comment on this, but it is getting ridiculous now.

There’s a case of scientific fraud going on at Copenhagen University, that isn’t just your “typical” fraud case, but is reaching the absurd.

Recent coverage (in Danish) here, here and here.

It is one of the poster scientists from Copenhagen U that is not only on trial for scientific fraud, but it has reach the level of legal actions against her for falsifying documents. She was reported to the police for this recently. It has even reached the ministerial office. This is just crazy!

This is someone who got the “elite scientist” award a few years back. I got the junior version in 2008, but I am feeling a bit ashamed of this now.

In an independent case — and for legal reasons, due to closed doors policies at court we have to assume it is an independent case although everything points towards the same scientist — a research at Copenhagen U tried framing a student for the fraud. Not just scientific but financial.

I find it hard to grasp how someone who has been known to have dodgy results — and this is an old case — was used as the poster child by the university, but even more I find it hard to grasp how scientific fraud can be taken this far.

I can sort of see how the cut-throat competition in a field where there are so few tenure positions per scientist can lead to desperation. I can see how someone who really believes in a hypothesis can doctor some experiments by avoiding the evidence against and promote the evidence for the desired result.

Not accept, but partly understand. Assuming that they believe that further studies will eventually prove the hypothesis.

But once you start sliding down that slippery slope, eventually you must know that you will be caught. We are in a field where everything is tested again and again.  You will eventually be confronted with dodgy results.

How can anyone believe that they will get away with it? That, I don’t get.

What I consider the strongest point of science is that the whole system is based on figuring out the truth. We have a system designed for that specific purpose, and it is amazingly powerful. If a theory is wrong, we will eventually find out. Then we will start to question why we thought otherwise earlier, and if it is because of dodgy science, we will figure that out.

Trying to cheat in science is silly. The whole system is designed to figure that out, and I feel angry that someone will even try!

Not only that. I get angry thinking about the time and money spend trying to validate results that are just based on bogus.

There is a reason we consider scientific fraud the highest sin a researcher can commit.

I feel insulted and really angry about this case.

CoalHMM analysis of the human/chimpanzee ancestor, based on the orangutan genome

February 3rd, 2011

I’ve been wanting to write about our paper on the orangutan genome for a while, but I’ve just been too busy so far, so a little late I finally get to it.

Besides the Nature paper, where we contributed to the analysis of the two sub-species of orangutans, we have two companion papers. One is already out in “early access” at Genome Research and the other will be out later in PLoS Genetics. Since the latter paper is not out yet, this post will be about the Genome Research paper.

Coalescent in an isolation model

Since all our work is based on coalescent theory and in particular CoalHMMs, I’ll start there.

Imagine we have two species, and we sample a gene in each. We can then ask, what is the divergence between the two genes? This divergence will be determined by 1) the divergence of the two species, let’s call that T, and 2) the coalescence time between the two genes within the ancestral species, let’s call that C.

The species divergence we assume is fixed for all genes, so while it is unknown it is not a stochastic variable. The coalescence time, however, is stochastic, and from coalescence theory we expect it to be exponentially distributed with a rate determined by the effective population size in the ancestral species.

We call this setup an isolation model, and we will use the distribution of divergence times to make inference about the speciation time and the effective population size in the ancestral species.

The figure below illustrates the setup.

If C is exponentially distributed, and the divergence is given by D=C+T, then we can make inference about both parameters as follows: We sample a number of independent genes and get their divergence time. For the exponential distribution, the mean is equal to the standard deviation, so looking at the standard deviation of the divergences we can get the parameter for the exponential distribution. That gives us the mean value of C, and if we then look at D-E[C] we get an estimate for T.

Below is an example of this, where I’ve estimated the coalescence rate and divergence time from 50 divergence samples.

Complications

This is all very simple, but there are a few problems.

First, you don’t really get independent samples of the divergence time between two species. If you sample n individuals from the first species and m from the second, the n in the first species will all have found a common ancestor before that lineage reach the ancestral species, and the same goes for the m samples in the other species. So no matter how many individuals you look at, you end up with a sample of two in the ancestral species. I’ve written about this before here.

It is not a show-stopper, though, since genes in different parts of the genome are close enough to independent. So if you sample different loci instead of different individuals, you get your independent samples. So while adding more individuals won’t help, having an entire genome to look at gives you plenty of samples.

The second problem is that we cannot actually get samples of the divergence time. You cannot look at two pieces of DNA and from that get their divergence. You need to estimate it. It isn’t really that hard, since you can get a good estimate from the number of differences between the two sequences. That is, if the entire alignment of sequences have the same divergence time.

If there is a recombination somewhere in the sequences, they do not have the same divergence time, and you cannot estimate the divergence.

You can get around this by looking at short DNA segments, where you expect few if any recombinations. You won’t get a good estimate of the divergence then, but you can maybe alleviate this by having a lot of genes (but estimating the coalescence rate based on a standard deviation that have contributions from both the coalescence process and the estimation problems is, well, problematic).

You’d also have to throw most of your data away if you are looking at short segments scattered along the genome (and you cannot have them too close to each other, because then they will no longer be independent).

The CoalHMM approach

The models we develop to deal with this are based on hidden Markov models.

Using these models, we can estimate the divergence time for single nucleotides. Normally you cannot, since they are either equal or difference, and that doesn’t tell you much about their divergence (is it zero for equal and infinity for different?). We can do this, because the flanking DNA contains information about this, whether recombinations have occurred or not, and we can capture this information through the Markov model.

It is a rough approximation to the coalescence process, but as far as we can tell, it works reasonably well.

We are getting pretty close to being able to estimate the distribution of divergence times using hidden Markov models, but the model we use is the one that will be published in PLoS Genetics soon and not the model we used in the Genome Research paper, so I’ll wait a bit with describing how that works.

The model we used in the Genome Research paper is the one described in this paper.

In this model, we do not attempt to estimate the actual divergence times, but instead use something called incomplete lineage sorting. The idea here is, that if we have a third species closely related to the other two, then sometimes the two sister species have such deep divergence times, that one of them can end up being closer related to the third species than its sister species.

This leaves a stronger signal in the DNA and is thus easier to model and make inference about.

The model based on this needs only four states: one state where the two sister species coalesce early, and three states with deep divergence. If the divergence is deep, the topology of relationships between the species should be uniform — each topology is seen with one third probability — and how often we see deep divergences is given by the two speciation times together with the effective population size of the ancestor of the sister species.

As we scan along a genome alignment, we can infer how often we see recent divergences and how often we see deep divergences, and how the deep divergences are distributed along the three topologies.

Below is a figure that Julien made for illustrating this.

With this model, you don’t extract as much information from the genomes as you would if you could estimate the divergence times, but with full genomes to work with, you have plenty of information to get good estimates.

You need three closely related species to work with, though.

Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection

And now, finally, we get to the paper.

Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection
Asger Hobolth, Julien Y. Dutheil, John Hawks, Mikkel H. Schierup and Thomas Mailund

Abstract

We search the complete orangutan genome for regions where humans are more closely related to orangutans than to chimpanzees due to incomplete lineage sorting (ILS) in the ancestor of human and chimpanzees. The search uses our recently developed coalescent HMM framework. We find ILS present in ~1% of the genome, and that the ancestral species of human and chimpanzees never experienced a severe population bottleneck. The existence of ILS is validated with simulations, site pattern analysis, and analysis of rare genomic events. The existence of ILS allows us to disentangle the time of isolation of humans and orangutans (the speciation time) from the genetic divergence time, and we find speciation to be as recent as 9-13 mya (contingent on the calibration point). The analyses provide further support for a recent speciation of human and chimpanzee at ~4 mya and a diverse ancestor of human and chimpanzee with an effective population size of ~50,000 individuals. Posterior decoding infers ILS for each nucleotide in the genome and we use this to deduce patterns of selection in the ancestral species. We demonstrate the effect of background selection in the common ancestor of humans and chimpanzees. In agreement with predictions from population genetics, ILS found to be reduced in exons and gene dense regions when we control for confounding factors such as GC content and recombination rate. Finally, we find the broad scale recombination rate to be conserved through the complete ape phylogeny.

In this paper we used humans, chimpanzees and orangutans.

The first question to ask is then, are these three species close enough that we see incomplete lineage sorting?

Without it, we don’t have the signal in the data that we need for the model.

Based on previous estimates of the species divergence times and ancestral effective population size of humans and chimpanzees we could work out that some was expected. So that is a good start. To make sure, though, we used some simpler approaches. We looked at indels to check if there would be signals in these supporting clustering of human and orangutan or chimp and orangutan and found that. We also looked at the distribution of alignment columns and again found some signals for alternative topologies of the three species. So with that checked, we applied the model.

From the model we estimate three things: 1) The speciation times for humans and chimps, and from the African apes and orangutan, 2) the effective population size of the ancestral species, and 3) in which regions of the genome humans and chimps, humans and orangutan, and chimp and orangutan are closest related.

I won’t say much about number two. The effective population size is a weird parameter that can be affected by so many things, that it is really hard to interpret, and right now we just don’t know what really is important, so I’d rather not make any claims (but I’ll say a few things about local effective population sizes towards the end of the post).

Number one is interesting because it tells us something about when humans diverged from the other two apes. Our estimates are measured in the number of substitutions since the divergence, but assuming a molecular clock and assuming we have a good estimate of the rate we can get an estimate in years.

Assuming a rate of around 1 substitution per nucleotide per billion years — an estimate based on several earlier papers that get this number from calibrations with the fossil record — we get a human/chimp speciation around 4-4.5 million years ago, and a human/orangutan speciation around 11-13 million years ago.

I really don’t know how reasonable this is, in relation to the fossil record, so this is when we got John Hawks involved. I have my fingers crossed that he will blog about this at some point.

There are good reasons to be a bit skeptical, though. From recent studies, we know that the substitution rate is lower in humans today, and if that is also true in the past, the estimates should be moved further back in time. We cannot get too far back, though, without running into inconsistencies in the deeper past, but how this will all play out once we do more analysis I cannot say yet. It is something we look into for the gorilla genome (and I’ll just leave that as a cliff hanger for now, I’ll get back to it when we have published that genome).

For number three, I don’t really know. You might be surprised that we are sometimes closer related to the orangutan than the chimpanzee, or you might not. It depends on your prior assumptions, I guess.

We didn’t really find anything cool correlated to the patterns of relatedness, so we don’t have much of a story to tell about this.

Ancestral selection

The final thing we looked at in the paper was correlations between incomplete lineage sorting and gene density.

Why this is interesting gets a bit technical but has to do with the effective population size.  As I mentioned above, it is a bit of a weird parameter, but one that is affected by selection. If you have a selective sweep the genetic diversity is reduced, and you see this as a reduction in the effective population size. The same effect is seen with purifying selection, where again the genetic diversity is reduced and so is the effective population size.

Incomplete lineage sorting is positively correlated with the effective population size, so if you observe a correlation between incomplete lineage sorting and gene density, it is a signal for selection.

We observe this, and take it as a signal that selection rather than just drift has been a major player in the evolution of our genome.

How much of a surprise this is depends on your prior assumptions again, I guess, but it does indicate that neutrality may not always be the obvious null model for genome analysis.

It is a pretty weak signal for this, though, in this analysis. We see so little incomplete lineage sorting for these three species that it is really hard to analyse it in detail.

When we get human, chimp and gorilla, there is a lot more incomplete lineage sorting, and we can do a lot more. We are seeing some cool signals there, but I’ll let that be the second cliff hanger for the gorilla genome paper.


Hobolth, A., Dutheil, J., Hawks, J., Schierup, M., & Mailund, T. (2011). Incomplete lineage sorting patterns among human, chimpanzee and orangutan suggest recent orangutan speciation and widespread selection Genome Research DOI: 10.1101/gr.114751.110

Orang genome video

January 29th, 2011

Interview with Devin Locke, the first author on the genome paper.

Some places, not everywhere…

January 26th, 2011

Just to avoid confusion, if you read this, it doesn’t imply this.

We are suggesting that humans and orangutans are closer related than either to the chimpanzees in ~0.5% of the genome (and chimpanzees and orangutans are closer related than either to humans in another ~0.5%). That has to do with incomplete lineage sorting, and does not, in any way, imply that we as a species are closer related to orangutans than to chimpanzees.

Oh, and it doesn’t really mean that orangutan is our closest living relative in ~0.5% of the genome either. It is just closer than chimpanzee, but the gorilla could be closer related to us in those positions, so…