On gene trees and species trees

Last week I reviewed a paper on inferring species trees based on gene trees, and I so wanted to write about it here, but of course I have to patiently wait until the paper is published.

However, today there appeared an application note in Bioinformatics (advanced access) on the topic — and there was another application note a few months back — so this gives me an excuse to write a few words about speciation trees and gene trees.

The relationship between gene trees and species trees is one of my own research interests, although not the inferrence of the trees.  In our CoalHMM work (Hobolth et al 2007), we use the relationship between gene trees to infer information about the speciation events.  Much more on that on a later day, though.

Species trees and gene trees

When you think about phylogenetic inference, you typically think about the relationship between species in a tree.  So, for instance, the relationship between human, chimp, and gorilla would group human and chimp together and have gorilla as an outgroup.

This is the relationship between the species, but it is not the whole story.  There is population genetics going on within the branches of this tree, which we can model as a coalescence process.  This is a generalisation of the Wright-Fisher process that is mathematically easier to work with, but for the points I will make here it might be easier to think of the Wright-Fisher process.

The Wright-Fisher process is a very simple mathematical model of the evolution of a population.  It says that we have a set of discrete non-overlapping generations, where each new generation is sampled from the previous by sampling at random with replacement.  So you start out with a set of of N individuals in the first generation and then you create the next generation by N times selecting a parent from the first population at random, and copy him to the next generation.

For the next generation you do the same, but this time you sample from the second generation (the one you just created)…

…and you continue this process for as many generations as you need.

This is how the process runs within a population.

When you have a speciation event, parts of the population branches off the other part — for some reason or other — and you can sample individuals in the two separate species only from individuals in the same species.

An example with two speciation events is shown below:

This process, running inside the species tree, has two consequences: DNA divergence times do not correspond to speciation times, and the toplogies for the “individuals” do not necessarily correspond to the species topology.

The first is obvious when you think about it.  The speciation even is the most recent time after which no individuals in two separate species can sample from the same individuals in the previous generation, so but that does not mean that when you consider the most recent ancestor of two individuals in separate species, that that ancestor is found exactly at the speciation event.  It can be much more ancient than that.

If you know the speciation time, say from the fossil record, you do not necessarily know the divergence time of the DNA.  Conversely, if you use the molecular clock to date the split between two species, you are not dating the actual speciation time but the DNA divergence time; the speciation time is likely to be more recent.

That the toplogy can be different than the species tree can be seen if you consider two speciation events close in time.  Consider two “individuals”, one from each of the two closest related species.  These can have a most recent common ancestor in their shared common ancestor in the time between the first and the second speciation event

or they can have a most recent common ancestor further back in time than the first speciation event, in which case an “indivdual” from the third species might share a common ancestor with one of them more recent.

Just to avoid confusion, when I say “individual” I don’t actually mean individual (which is why I quote the first).  There are no present day humans more related to chimps than others — although you sometimes get that impression.

The time since the speciation event is such that all humans (or chimps or gorillas) will share common ancestors much more recent than the speciation events.

The process involves recombinations, however, so if we trace a single individual’s genealogy back in time, the nucleotides will split apart and join up again in a stocastic process,

and at the time of the speciation event they will be distributed on a number of different chromosomes (“individuals”)

and it is these DNA chunks that can end up having different topologies than the species topology.

Different segments of the genome will have different divergence times and possibly different toplogies.

When we talk about gene trees (in contrast to species trees), we are talking about the trees for the individual segments of our genome, and when they differ significantly from the species tree (in either branch lengths or topology) inferring the species tree can be problematic.

Inferring species trees and gene trees

The two applications that I used as an excuse for writing this post concerns inferring species trees from gene trees, or jointly with gene trees.  Both takes statistical approaches; one Bayesian the other Maximum Likelihood.

The first method, BEST (Liu 2008) jointly estimates gene trees and the species tree from alignments.  The idea is that the species tree puts constraints on the coalescence times of the gene trees (they must be compatible with the species tree, so two species in a gene tree do not join up more recent than the speciation event, and the distribution of the tree is given by the underlying coalescence process) and conversely the gene trees put constraints on the species tree (the same constraint about coalescence times) so you can sample one tree when keeping the other fixed, and then use an MCMC framework to sample over trees.

This way you can sample over the posterior probability of both species trees and gene trees.  The process is somewhat time consuming, so probably not practical for genome wide analysis, but nice in its (relative) simplicity nonetheless.

The other tool, STEM (Kubatko et al. 2009) takes a set of gene trees as input and estimates the species tree in a Maximul Likelihood approach.  Again this is done by considering the constraints that the gene trees put on the species tree (together with the underlying coalescence process, of course).

One weakness in both method is the assumption that the gene trees correspond to true underlying coalescence trees.  This is unlikely to be true for real gene trees for two main reasons:  First, the gene trees are inferred and therefore can be incorrect, and second, in a coalescence process with recombination (the process where incomplete lineage sorting occur) it is unlikely that recombination events only occur between and not within the regions used to infer the gene trees.

The first problem, that the gene trees can be incorrectly inferred, is less of a problem for BEST, since it jointly infers the trees, so sampling an incorrect tree from time to time can be corrected through the MCMC run.  I could imagine it being more of a problem for STEM.

The second problem, I think, is a major problem for both.  There are two “sub-issues” here.  One, they assume that there is no recombination within a gene, and second, that different genes are independent (essentially have enough recombination between them that they are in linkage equilibrium).

If you only consider genes far apart, the second assumption is probably not much of a problem, but it does mean that the method cannot scale to whole genome analysis, even if it was computationally feasible, since you cannot have genes close to each other without them being at least slightly correlated.

The first issue is more serious, I think.  If you consider a DNA segment long enough that you can reliably infer its genealogy, it is unlikely that there are no recombinations within that segment, and those are as likely to give you different coalescence times and different topologies as the recombinations between the genes.

The problem with that is, that if you infer a single topology for a region that really have more, you are unlikely to recover any meaningful genealogy.

I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment.  That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn’t know.

I plan to look into it, though, when I get the time… which won’t be any time soon, unfortunately, since I am pretty swamped in other projects right now.


  1. L. Liu (2008). BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24 (21), 2542-2543 DOI: 10.1093/bioinformatics/btn484
  2. L. S. Kubatko, B. C. Carstens, L. L. Knowles (2009). STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence Bioinformatics DOI: 10.1093/bioinformatics/btp079


Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

9 thoughts on “On gene trees and species trees”

  1. Well…
    I’m not sure that you have any realtions to THAT ape, but I will give you that there must be someone in your family who mated with a primate… (at som point)

  2. I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment. That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn’t know.

    I plan to look into it, though, when I get the time… which won’t be any time soon, unfortunately, since I am pretty swamped in other projects right now.

    I’d sure like to see you get back to it. I know of one published comparison of tree inference methods on something close to molecular data using Avida, an evolutionary simulator where one knows the actual genealogy of the opcode programs that are evolved. It’s here, unfortunately behind a pay wall. The abstract:

    Phylogenetic trees group organisms by their ancestral relationships. There are a number of distinct algorithms used to reconstruct these trees from molecular sequence data, but different methods sometimes give conflicting results. Since there are few precisely known phylogenies, simulations are typically used to test the quality of reconstruction algorithms. These simulations randomly evolve strings of symbols to produce a tree, and then the algorithms are run with the tree leaves as inputs. Here we use Avida to test two widely used reconstruction methods, which gives us the chance to observe the effect of natural selection on tree reconstruction. We find that if the organisms undergo natural selection between branch points, the methods will be successful even on very large time scales. However, these algorithms often falter when selection is absent.

  3. Thanks for the link.

    I hadn’t thought about the interference between tree inference and selection, to be honest. I know that recombination messes up the tree inference, and that changing topologies can look a lot like selection if you test for selection afterwards, but I am surprised that selection itself can mess up the tree inference.

    I’ll read the paper and see what happens. Thanks again!

  4. Pingback: The Panda's Thumb
  5. Thank you for your post. Most instructive. I have a rather basic question regarding precise vocabulary. Us seem to use the term «topology» as to merely referring to the pattern of association among OTUs, e.g., in Newick for ((a(b,c)), excluding branch lengths. In which case, the terms geneology (with relation to genes) or phylogeny (with relation to taxa) would be the proper term to describe a tree that includes both association between OTUs (the topology) and branch lengths. Is this the case, or does the term «topology» already imply branch lengths. Thanks

  6. André: I’m not sure I’m entirely consistent in my writing, but in general I try to use topology to only mean the clustering (so ignoring branch lengths).

    There is a lot of information in the branch lengths, both on gene trees (genealogies) and species trees (phylogenies) – don’t get me wrong – but by topology I only mean the grouping.

    In the example in the post, the branch lengths tells you a lot about the ancestral effective population size, but only if you also see different topologies can you be in any confusion about the phylogenetic relationship.

    Varying branch lengths alone cannot question which species are more closely related to others. Varying topologies can.

    Short branch lengths in the phylogeny, combined with large effective population sizes, is likely to result in different topologies.

    Large branch lengths in the phylogeny, even with large effective population sizes, will still give you gene topologies that matches the species topology.

    The genealogies – gene trees including branch lengths – will never match the species tree (the phylogeny) regardless of the phylogeny branch lengths.

    I hope that makes my terminology clearer. If not, let me know.

    Also, let me know if I am being inconsistent with the terminology. I am not as careful with it on the blog as I would normally be :)

Leave a Reply