Archive for the ‘Research’ Category

On segment lengths, going back in time, in the coalescence process. Part 3: Segment lengths of MRCAs of two species

Sunday, December 27th, 2009

A while ago I wrote the first two posts in this “series” on the coalescent process and how genomic segments behave as a dynamic system back in time.  Read Part 1 (about a single sequence) here and Part 2 (about two sequences) here.  I wanted to write quite a bit more about it, but what happened was that I managed to translate it into a method for population genomics and I was busy working on that and writing a paper about it so I left the blogging for a while.  The paper is now under review, and I have time to write about some of the ideas here again.

As you may recall, we can think of the ancestry of a genomic sequence as a dynamic system that, when run back in time, splits up into different lineages through recombination and then merge again through coalescence events.  This is just the classical coalescence process with recombination, and the result of running it is the so-called Ancestral Recombination Graph, or ARG.

We can use this model both at the population level, but also at the inter-species level (where the ancestries just go back a lot longer in time).

The mathematics is fairly easy to deal with when simulating — it is a relatively simple continuous time Markov chain (CMTC) — but because the state space is rather large (infinite if you consider a continuous segment as the genomic segment), inference is rather difficult as is working out some of the properties of the system.  For example the different properties I wrote about in Part 2.

Markov approximations

In the work we are doing here at BiRC on “ancestral population genomics” we work around this problem by making a simplifying assumption.  We assume that the process is not only a Markov chain in time, but also along the genomic segment.  What this means is that instead of dealing with the enormous — or infinite — state space of the coalescent with recombination, we can consider only neighbouring pairs of nucleotides, work out their dynamics, and extrapolate from there using the Markov assumption.

In Part 1, I described a simple example of this, when considering the dynamics of a single genome going back in time, and described how the Markov assumption is incorrect, but still reasonable in approximating the coalescent process.

In this post I am going to approximate the ancestry of two (haploid) genomes through a CTMC modeling two neighbouring nucleotides from each of the genomes.

The states of the system is shown below:

CTMC statesEssentially the states captures all combinations of a left nucleotide being linked with a right nucleotide (a black line or no line) between the two sequences (top black dots and bottom black dots) and the nucleotides having found a most common recent ancestor (MRCA, the white dots) or not (the black dots).

You can move between the states through recombination events — that remove the links between right and left nucleotides — or coalescence events — that either adds a link between a left and right nucleotide or joins two left of two right nucleotides in their MRCA.

The rate matrix for the system is shown below, where the blank cells are zero entries and the “-” cells contain -1 times the row sums, as usual.

Rate matrix

Here C is the coalescent rate and R the recombination rate.  Typically, we use a time scale in coalescent theory of 2 N_e generations per time unit in which case C would be 1.  If N_e is 10,000 as for humans, then a recombination rate of 1 cM/Mb would be a rate of R=0.004 on that time scale.

When considering two genomes back in time, we start in one of the states in \Omega_B — the “beginning states”.  If we have two haploid genomes and we start the system at time zero, we would probably start in state 4 where both the genomes are linked, but if we consider an inter-species system we would get some probability distribution over the states in \Omega_B from running the single sequence system from Part 1.

The states in \Omega_L (or \Omega_R) corresponds to states where only the left (or only the right) nucleotides have found their MRCA while the right (or left) have not.  The “end states”, \Omega_E contains the two states where both left and right nucleotides have found their MRCA, and is essentially just the single sequence system from Part 1.

Running the system back in time, the probability of being in one of these four classes of states evolves as this:

Dynamics of the CTMCBeing in either \Omega_L or \Omega_R is always somewhat unlikely, since it requires that a recombination has decoupled a left and right nucleotide before finding a MRCA, and it is more likely that the left and right nucleotides find their MRCA in the same coalescent event, moving directly from \Omega_B to \Omega_E.

Deep and shallow coalescent segments

With a finite state CTMC as the one above, we can compute pretty much everything of interest, and there is an extensive theory for how to manipulate these systems.  If we want to say anything about the ancestry of genomes, though, we have to remember that the system only deals with two nucleotides, which in anyone’s book is quite a bit shorter than a genome.

To get from a pair of nucleotides to a full genome, we use the Markov assumption.  When modeling n nucleotides, we assume that we can model the ancestry of number n conditional on the previous n-1 by only conditioning on number n-1, and then we can compute the probabilities of those ancestries using our two-nucleotide CTMC.

For the rest of this post, I’ll give an example of this, and then give other examples in following posts.  When the paper I have under review now is out, I’ll also explain how this CTMC can be used to construct a CoalHMM for population genomics.

Anyway, the example: Let us consider the human-chimp-gorilla trio.  We know that these three species are very closely related, to the point where there is extensive incomplete lineage sorting.  This happens when we have deep coalescences, compared to the speciation times.  If the human and chimp lineages have not found their MRCA before the gorilla speciation (when considering the process back in time), then it is possible that the human coalesces with the gorilla before the chimp, giving us a ((human,gorilla),chimp) genealogy, and similarly for chimp and gorilla coalescing before reaching the human.

Some things about this system are easy to work out from the coalescence process. Working out the probabilities for deep coalescences and incomplete lineage sorting is pretty simple.  The coalescence process has a simple exponential distribution, so figuring out the probability that the human and chimp have not coalesced before the gorilla speciation (as a function of effective population sizes and speciation times) is straightforward.

If the lineages have not coalesced before the gorilla speciation, then there are three possible topologies for the genealogy, ((human,chimp),gorilla), ((human,gorilla),chimp) and ((chimp,gorilla),human), each equally likely, and only one of which, ((human,chimp),gorilla), is congruent with the species tree.

Congruence graphSo, e.g. to work out the probability of seeing a gene tree equal to the species tree, you just need to work out the probability that human and chimp coalesce before or after the speciation time.  The gene tree is congruent if they do, or 1/3 of the times when they do not.

Congruence equationAll in all pretty simple, but it is only telling us how many nucleotides we expect to see congruent with the species tree, not anything about how they are distributed along the genome.

Here, I am going to consider the slightly simpler question of segments above or below the speciation time.  It is straightforward to work out the probability of seeing a nucleotide above \Pr(T>\tau) or below \Pr(T\leq\tau) the speciation time, but what is the pattern of segments above or below along the genome?

Let’s call a contiguous segment of nucleotides that all coalesce further back than the speciation gorilla time a “deep coalescence segment” and a contiguous segment of nucleotides that all coalesce before the gorilla speciation a “shallow coalescence segment”.

Oh, by the way, before above means closer to the present … it is confusing, I know, but time runs backwards in coalescence theory.

We can use the Markov approximation to work out the mean length of deep and shallow coalescence segments as a function of the speciation time.

Under the Markov approximation, we can argue as follows: For a deep coalescence, consider a left nucleotide in a deep state.  The nucleotide to the right of it is either also deep, or it is shallow.  If it is deep, then the next one can be deep or shallow, and so on.  Under the Markov approximation, the probability of moving from deep to shallow is the same in each such step, and the length of a deep segment is given by the waiting time for moving from deep to shallow, a waiting time that is geometric distributed.

For the CTMC states, we have the left nucleotide in a deep state if we are in one of the states in the classes \Omega_L or \Omega_B at the speciation time, while we have the left nucleotide in a shallow state if we are in one of the states int he classes \Omega_R or \Omega_E.

If the left nucleotide is in a deep state, then the right is also in a deep state with probability \frac{\Pr(\Omega_E)}{\Pr(\Omega_R)+\Pr(\Omega_E)} while the right is in a shallow state with probability \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}.  Similarly for when the left nucleotide is in a shallow state.

The mean length of a deep coalescence segment is then 1 / \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)} while the mean length of a shallow coalescence segment is  1 / \frac{\Pr(\Omega_L)}{\Pr(\Omega_L)+\Pr(\Omega_B)}.

Very easy to compute with the CTMC, but pretty hard from the coalescence process.

Of course, it is only an approximation, so we should worry a bit about how accurate it is.  I’ve simulated some data sets using the coalescence process, using a human-chimp divergence of 4.5 mya, a present day effective population size of humans and chimps of 10,000 or 50,000 and an effective population size of the human-chimp ancestor of 50,000 or 100,000.  The plots below shows the distribution of deep and shallow segments as the boxplots, the mean length of these as the green bullet, and the expected mean lengths as computed above as the blue line.

10K-50K10K-100K50K-50K50K-100KAll in all, it is a pretty good fit, so the approximation looks okay.

If we assume that the present day Ne of humans and chimps is 20,000 (a bit high for humans and a bit low for chimps, but on average okay) and that the human-chimp ancestral Ne is 50,000 (as often estimated) then with a gorilla branching off 1 milion years before the human-chimp speciation, we expect that segments that coalesce deeper than the speciation are ~116bp while segments that coalesce before are ~76bp.

Of course, this is only on average, and the true picture is more complex, but it does tell us that the genomic relationship between these three species is somewhat complex with a lot of short genomic fragments with MRCA in the human-chimp ancestral species or with MRCA in the shared African ape ancestor.

Draft genomes and finished genomes

Tuesday, December 8th, 2009

Finch Talk has a nice post about draft and finished genomes, slightly related to my post from a few days ago.

342-340=+2

Some feedback would be nice…

Tuesday, December 8th, 2009

I just got two research grant applications rejected this week.

Nothing wrong with that, really, the success rate where I applied is less than 10% so I couldn’t really expect to get the grants.  Still, it annoys me a little bit that the letter I got back is just boiler plate.  “We receive more applications than we can fund, and unfortunately we could not fund yours.”

That doesn’t tell me if I was close to getting funded or lightyears from it.  It certainly doesn’t help me improve on the application if I want to try again.

Some feedback on the applications would really be nice.  If not a full reviewer report, then at least a score or something…

Oh well, I still have one grant application under review, this one for the EU Research Council, and here at least I’ve already gotten a review report back, with scores and everything, and it looks really nice, so I have my fingers crossed…

342-339=+3

Ardi and chimps

Tuesday, October 6th, 2009

As I have said before, I don’t know enough paleontology, so I’m just speculating now and would love to be corrected by someone with more knowledge in the field…

One thing that bothers me with the whole Ardipithecus brouhaha is its age and its relationship with chimps.  Simply because I cannot completely reconcile it with the genetic estimates of speciation with chimps.

Estimating speciation times

First some observations on species divergence, genetic distance and fossils… If we consider even the simplest scenario for speciation — an allopatric speciation (and more complex scenarios will just exaggerate the problem — a population is split into two that eventually evolve into separate species.

Due to the population genomics in the population before the split, individuals in the two resulting populations will have different degrees of relatedness within and between the populations.  On an individuals level, eventually everyone within a population will be more related than between populations, but because of recombination the relatedness between the populations — and eventually species — will vary along the genome and the average sequence divergence will be greater than the divergence between the populations / species.

If we date the speciation by considering genomic differences, we are thus overestimating the time back to the ancestor.

If we consider morphological changes between the two populations — as we do if we consider fossils — those will necessarily have evolved after the split.  At least if we are looking at morphological traits that clearly separates the two populations and not just variation within a single species.

So for fossil evidence we will get an underestimate of the speciation time.

Speciation, fossil evidence and sequence divergence(Please ignore that the skull in the illustration is a gorilla, it was what I had lying around when I made the figure, but my focus is on chimps so that is the speciation I am illustrating)

Now, human-chimp sequence divergence looks like it is about 6 million years ago, but remember that this is sequence divergence and not species divergence.  The speciation event must be more recent.

Through statistical models of population genetics we can estimate the speciation time from the sequence divergence time and if we do, we get estimates of the human chimp species divergence closer to 4-5 million years ago (Dutheil et al. 2009; Hobolth et al. 2007; Patterson et al. 2006).

This puts Ardi, at 4.4 million years ago, smack on top of the speciation event!

Humans, chimps and Ardipithecus

Now I’m not saying that I trust the genetics more than the paleontology.  I understand it a whole lot better, but that is a different issue.

One thing that is clear, though, is that if the genetic estimates are true, then Ardipithecus cannot have evolved a whole lot away from the common ancestor of man and chimp.  It actually could be the last common ancestor, just judging from the time estimates.

If we believe that Ardi shows us that many of the traits we believed were derived traits on the human line are actually ancestral traits that changed on the chimp line instead, then we could put Ardi close to the human-chimp split and reconcile the genetic and paleontological estimates for the time of the split.

If it has evolved significantly towards humans it is back to the drawing board for the genetic models.

Which is it?

I really look forward to see what John Hawks has to say about it when he gets to it in his Ardipethicus FAQ

279-311=-32

The problem with confidence intervals

Friday, August 7th, 2009

I’ve ranted plenty about the problem with p-values and the common misunderstanding that the p-value is the probability of the null model being true.  But what about confidence intervals?

It is a common misunderstanding that if you have a 95% confidence interval, then the parameter you are estimating is within the interval with 95% probability.  This turns out to be just slightly wrong, and where the misunderstanding about p-values simply doesn’t make sense at all considering how p-values are used, you cannot really run into much trouble with this misunderstanding.

Inference uncertainty

Okay, so what is a confidence interval, and why is the above a misunderstanding?

A confidence interval is something we use when estimating parameters that is supposed to capture the uncertainty there is in the inference.

Let’s say I want to know the weight of the water in a small lake close to where I live.  Don’t ask why, what I do in my free time is none of your business, and anyway it is just an example so play along!  So I go down to the lake and get a liter of the water and weigh it. Yeah, it is probably going to be close to 1kg, since that would be the weight of pure water and the lake water presumably is mostly water, but it will not be exactly that because of all kinds of impurities in it.

So the parameter I’m interested in is the weight of the water.  The water is probably reasonably homogeneous so it doesn’t matter where I sample the water.

If I just do this once, I will have a single measurement and that will be my estimate of the parameter, but homogeneous or not there might be a bit more mud or something in the sample I take than usual, or a bit less, so there could be some measurement error so my one measure is not the true value.

You all know how to get around this: you make several samples and use the average as the estimate.

All well and good, but once you start making several samples to get the parameter, you have already admitted that you have some uncertainty in your measurements, and that leaks into uncertainty in your estimate.  The degree of uncertainty, however, cannot be seen from the final estimate.

This uncertainty depends on the variation in your measures, of course – if they vary a lot you are less certain than if they are all roughly the same – but also on the number of samples – with just the first sample there is no variation in measures but that doesn’t mean that you will trust that one measure more than the average of ten measures.

Confidence intervals

It is this uncertainty that is captured by the confidence interval.

Before we can quantify the uncertainty at all we need to set up a model of the data, of course.  There are no “one size fits all” with that, but lots of simple models that often work.

For the water weight example a good first attempt would be to assume that the measurements were stochastic variables X distributed as X\sim N(w,\sigma^2) where w is the true weight – the parameter we wish to infer – and \sigma^2 the variance in measures caused by the measurement uncertainty, a nuisance parameter.

When estimating w we sample n observations and use the average value, so from the model of individual measurements we also get a model of the estimator.  In this particular case from the central limit theorem.

In general, a confidence interval for parameter \theta is two statistics l(X) and u(X) – functions of the stochastic variable X – satisfying that P\left(l\left(X\right) \leq \theta \leq u\left(X\right)\right)=1-\alpha where 1-\alpha is the confidence level, typically 95%.

This might look a little complicated, but it isn’t really.  Just like the mean of a set of observations – we could call it m(X)=\frac{1}{n}\sum_{i=1}^nX_i – is a way of summarising an aspect of the data, so are l(X) and u(X), and just as the mean of n observations is a stochastic variable (until you actually make the observations) so will l(X) and u(X) be.

The trick, of course, is defining them so they satisfy P\left(l\left(X\right) \leq \theta \leq u\left(X\right)\right)=1-\alpha.  In general this is not easy to do, but for our example here it is.

We know that our estimate – the mean of observations – is distributed as \hat{w} \sim N(w,\sigma^2/n) when \hat{w} is the mean of n observations.  That’s what we got from the central limit theorem.

So given w and \sigma we can easily define an interval that \hat{w} will fall into with probability 95%, and the obvious choice here would be the symmetric one around w.  We could call this the 95% interval (but not confidence interval quite yet since it doesn’t refer to any given estimate or observations yet).

It has the properties we would expect from a confidence interval, though, it depends on the underlying uncertainty in the model, \sigma^2, and it gets smaller as we get more data and reduce our uncertainty.

The problem is, of course, that this is completely useless since we don’t know w and \sigma and we need those guys to get this interval.

Well, not quite useless.

Because we know the distribution of \hat{w} we know how much larger or smaller than w we expect it to be (the 95% interval from above), and since \hat{w} is symmetrically distributed around w it is the same distances when \hat{w} is above and when it is below w.

So we can get the 95% confidence interval for \hat{w} by taking the width of the 95% interval (for w) but centering it at \hat{w}.

For the confidence interval defined this way, for w not to be contained in the confidence interval, \hat{w} would have to be either smaller than the lower limit of the 95% interval or larger than the upper limit of the 95% interval, in other words for w not to be contained in the 95% confidence interval, \hat{w} would have to not be contained in the 95% interval which it is with 95% probability, so the confidence interval defined this way satisfy the requirement we made for it.

(Yes, I know that I also need to know \sigma for this to work, and there is some added uncertainty from using an estimate rather than the true value.  The width of the interval, for example, will be a function of the estimate of the variance if you do as above, and if you underestimate \sigma your interval will be too narrow.  Go read a statistics text book if this bothers you, for the example here I’m just going to ignore it.)

What’s with the misunderstanding about confidence intervals?

Ok, so that really does sound like if we have a confidence interval, then the true parameter is within the interval with 95% probability, right?

Yes, almost, which is why I wrote that this misunderstanding doesn’t matter much and won’t get you into any problems (unlike the p-value misunderstanding that leads to craziness and probably cases cancer and global warming as well).

The distinction is almost philosophical and has to do with what we consider stochastic and what we consider fixed.

Here, we consider the parameter w unknown but fixed.  It doesn’t vary and there is no stochasticity to it.  So it simply doesn’t make sense to talk about the probability of it being in any given interval.  It is either in the interval or not in the interval and that is all there is to it.

The confidence interval, on the other hand, is stochastic.  If we did the experiment 10 times, we would get 10 different confidence intervals, all which would contain the true parameter with probability 95%.

The confidence interval will contain the true parameter with 95% probability, but the true parameter does not fall within the interval with 95% probability.

The difference is so subtle that it borders the ridiculous to even talk about it.

Credible intervals

In Bayesian statistics it is the other way around.  There you do get fixed intervals and random parameters.

Here parameters we are uncertain about are not considered fixed but unknown, but are assumed to be stochastic. Based on observations and a prior probability distribution you would infer a credible interval that is then considered fixed, while the parameter is stochastic (but falls within the interval with 95% probability).

Anyway, if you have a confidence interval, don’t say “with 95% probability the parameter lies in the interval [a,b]…” since strictly speaking that is incorrect.  For a credible interval it would be correct.

In practise, it makes no difference what so ever, so don’t worry too much about it.

219-219=0