Estimating Divergence Times
Below is the introduction text to some lecture notes I’m working on. I’m putting them up here to get some feedback, since this is the part in my lecture notes I am the least sure about. The rest of the notes will be on mathematical models, and I am pretty confident that I understand those, but my paleontology knowledge is shaky at best, so any corrections, comments or suggestions for papers I should read will be most welcome!
Fossil and genetic evidence: Lower and upper bounds
When estimating the evolutionary relationship between species we have two sources of data we can use to date when species diverged: fossil evidence and genetic evidence, the latter based on the assumption of the molecular clock that lets us estimate divergence time based on the observed differences between genomic sequences. Both are by their very nature biased, but in opposite direction. Dates based on fossil evidence gives us a lower bound on the speciation time, while genetic evidence gives us upper bounds on the speciation time [1].
Fossils can be dated reasonably accurate through physical or geological methods, but they rely on morphological differences between species. Morphological characteristics unique to one set of species, when found in a fossil, tells us that the given group of species diverged from other species before the time where the fossil species existed. Deciding which morphological features are unique to a given group of species is, of course, somewhat subjective, but ignoring this, the fossil date is only a lower bound of the species split since the morphological features will have to have evolved after the species split. How long it took for these features to evolve, plus how close the fossil is to the emergence of the features considering possible gaps in the fossil record, influences how tight the lower bound is.
For genetic data, on the other hand, population genetics in ancestral species influences the dating of species splits. The coalescence process [2] in population genetics means that when we consider two genomes in the same population, they have a most recent common ancestor (MRCA) some distance back in time. When considering two genomes from different species, the MRCA is found at a distance back in time first given by the divergence of the species, and then the divergence the two genomes have within the ancestral species.
The divergence of genomes within a species depends on the effective population size; a technical term referring to the the population size of reproducing genomes. The larger the effective population size, the further back in time the MRCA will be found. On average, the number of generations back in time the MRCA will be found is equal to the effective population size. So for two genomes in the same species, we expect their MRCA to be found 2Ne generations back in time, where Ne is the number of diploid individuals reproducing, and the factor of two because in a diploid population of (effective) size Ne, there are 2Ne genomes.
The genetic distance between two species is therefore given by the species split plus 2Ne generations, and the genetic distance is thus an upper bound on the species divergence.
For humans, the effective population size is ~10,000, so two random human genomes are expected to have diverged around 20,000 generations ago, or 400 thousand years ago (kya) assuming a generation time of 20 years along the lineages back to the MRCA. This puts the sequence divergence of humans, who’s species divergence is of course zero, back to a point before the evolution of modern Man and before the speciation between modern humans and Neanderthals.
The molecular clock
We cannot directly observe the divergence between genomes, so genetic dating of speciation relies on the observed differences between genomic sequences. An underlying assumption when doing this is that mutations to genomic sequences occur at a constant rate through time, so the number of mutations are proportional to the time between the genomes; two times the divergence time, since mutations occur on both branches of the split.
We cannot directly observe the number of mutations that occurred between species either. We can only observe the differences between observed sequences. Mutations that occur on lineages that are eventually lost in a population because they leave no present day offspring cannot observed. Only those that survive to be observed in the genomic sequences we can observe. Mutations that spread to the entire species we say gets fixed in the species, and we call such mutations substitutions. When comparing genomic sequences from different species, we mainly observe such fixed mutations unless the species are very closely related and polymorphism in the ancestral species has not been fixed within the decedent species.
For neutrally evolving sequences, sequences not under selection, the number of substitutions is equal to the number of mutations [2]. That is, the number of substitutions that are fixed within a species through the population genetics process are equal to the number of mutations that occur within the species. For species such as primates, we expect most of the genome to be evolving neutrally, since the genomes of these species consists mainly of “junk” DNA that is unlikely to be under selection.
Assuming that the sequences are mainly evolving neutrally, and assuming that mutations occur at a regular rate, we can estimate the number of mutations that occurred between two species using so-called substitution models, that compensate for recurring mutations, mutations at the same genomic site, and translates the number of observed differences between two sequences into expected number of mutations that occurred.
Since mutations enter the sequences through a chemical/physical process, the assumption of a regular rate is not far fetched, and in general there is a close correlation between divergence of species from fossil evidence and the number of mutations estimated from the substitution models. The rate of substitutions does seem to vary somewhat between divergent species groups, with a slow-down in apes compared to old world monkeys and with slight variations even within different primate groups [3]. Within a group of closely related species, however, such as the great apes, the evidence generally seems to justify the molecular clock assumption [3].
There is one important caveat, however: We might be able to estimate the number of mutations that occurred but if we do not know the rate in which new mutations occur we cannot translate the number of mutations into years of divergence.
Calibrating the molecular clock
To translate the number of mutations that occurred in a time interval into the number of years of the time interval, we need need to know either the rate with which mutations occur, or how long the time interval was. We to calibrate the molecular clock.
The approach typically taken is to have a calibration point, a point in time where we are reasonably sure we know the divergence time of two sequences in years, and use the number of mutations between the two sequences to give us the of mutations per of years.
If we pick a point far enough back in time, the relative difference between the sequence divergence and the species divergence will be small. The difference between the two will be 2Ne generations which might be a difference of hundreds of thousands of years; relatively little if the species divergence is in millions of years.
Of course, we cannot go far enough back in time that the mutation rate has changed, so there is a trade-off between the relative difference in sequence distance and species difference and how conserved the mutation rate is.
For dating the evolution of great apes, one calibration point is the divergence between old world monkeys and apes (catarrhines; lesser apes and greater apes). Based on fossil evidence we expect the split to be between ~20 million years ago (mya) and ~30 mya [1]. That is, we have fossils indicating that the split had occurred ~20 mya and fossils that are believed to be older than the split at ~30 mya.
Only the lower bound of this informs us of the split time, however. The lack of fossil evidence is not evidence that the split occurred later than ~30 mya. Absence of evidence, after all, is not evidence of absence.
Still, it gives us a tentative calibration point, with a relative uncertainty of ~30% of the divergence of the two groups of species.
Consequences of incorrect calibration
The genetic (sequence) divergence between two genomes is an upper bound of the species divergence, but a consequence of the calibration problem, genetic estimates of divergence can turn out to be underestimating the speciation times.
If the calibration point underestimates the number of years between the species split, the number of mutations per year will also be underestimated. Consequently, the genetic estimates, while over-estimating the species split in number of mutations, will underestimate the the years separating genomes [1].
This has consequences for our inference about the evolution of great apes and the relationship between humans and our ancestors. Calibrating the molecular clock based on an old world monkey / ape divergence of ~25 mya ago, a time point in the middle of the expected divergence time, will put fossils such as Ardipithecus, Orrorin and Sahelanthropus further back in time than the split between human and chimpanzee, while a calibration point based on a ~30 mya divergence of old world monkeys and apes would put the same fossils after the split between human and chimpanzee; potentially on the lineage leading to humans [1,4].
Conversely, assuming that Sahelanthropus is on the human-specific lineage puts the human-chimpanzee split in the range of 6-7 mya. Using this as a calibration point, the ape / old world monkey divergence is estimated ~27 mya for the lower end of the calibration interval and ~36 mya for the upper range of the calibration interval [3].
Incorrect calibration of the molecular clock can thus turn, what should be an upper bound into under estimates of the sequence divergence, when measured in years rather than number of mutations. An underestimate of an upper bound tells us next to nothing about the true value, unless we have some grasp of how tight the bounds are, but unfortunately this is the best knowledge we currently have about the divergence time of species.
Our best approach to alleviating this problem is working out the uncertainties in the upper and lower bounds, and that way discarding extreme consequences of the calibration.
From population genetics theory we can make inference about the relative over-estimation caused by the sequence divergence within the coalescence process and disentangle the species divergence from the sequence divergence [5,6]. From this we can tighten the intervals consistent with the fossil record.
References
- Stepier, M.E. & Young, N.M. Timing primate evolution: Lessons from the discordance between molecular and paleontological estimates. Evol Anthropol 17, 179-188 (2008).
- Hein, J., Schierup, M.H. & Wiuf, C. Genegenealogies, variation and evolution: A primer in coalescent theory. Oxford University Press (2005).
- Steiper, M.E. & Young, N.M. Primate molecular divergence dates. Mol Phylogenet Evol 41, 384-394 (2006).
- Stauffer, R.L., Walker, A., Ryder, O.A., Lyons-Weiler, M. & Hedges, S.B. Human and ape molecular clocks and constraints on paleontological hypotheses. J Hered 92, 469-474 (2001).
- Dutheil, J.Y. et al. Ancestral population genomics: The coalescence hidden Markov model approach. Genetics 183, 259-274 (2009).
- Hobolth, A., Christensen, O.F., Mailund, T. & Schierup, M.H. Genomic relationship and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov Model. PLoS Genet 3 (2007).

