On gene trees and species trees
Last week I reviewed a paper on inferring species trees based on gene trees, and I so wanted to write about it here, but of course I have to patiently wait until the paper is published.
However, today there appeared an application note in Bioinformatics (advanced access) on the topic -- and there was another application note a few months back -- so this gives me an excuse to write a few words about speciation trees and gene trees.
The relationship between gene trees and species trees is one of my own research interests, although not the inferrence of the trees. In our CoalHMM work (Hobolth et al 2007), we use the relationship between gene trees to infer information about the speciation events. Much more on that on a later day, though.
Species trees and gene trees
When you think about phylogenetic inference, you typically think about the relationship between species in a tree. So, for instance, the relationship between human, chimp, and gorilla would group human and chimp together and have gorilla as an outgroup.
This is the relationship between the species, but it is not the whole story. There is population genetics going on within the branches of this tree, which we can model as a coalescence process. This is a generalisation of the Wright-Fisher process that is mathematically easier to work with, but for the points I will make here it might be easier to think of the Wright-Fisher process.
The Wright-Fisher process is a very simple mathematical model of the evolution of a population. It says that we have a set of discrete non-overlapping generations, where each new generation is sampled from the previous by sampling at random with replacement. So you start out with a set of of N individuals in the first generation and then you create the next generation by N times selecting a parent from the first population at random, and copy him to the next generation.
For the next generation you do the same, but this time you sample from the second generation (the one you just created)...
...and you continue this process for as many generations as you need.
This is how the process runs within a population.
When you have a speciation event, parts of the population branches off the other part -- for some reason or other -- and you can sample individuals in the two separate species only from individuals in the same species.
An example with two speciation events is shown below:
This process, running inside the species tree, has two consequences: DNA divergence times do not correspond to speciation times, and the toplogies for the "individuals" do not necessarily correspond to the species topology.
The first is obvious when you think about it. The speciation even is the most recent time after which no individuals in two separate species can sample from the same individuals in the previous generation, so but that does not mean that when you consider the most recent ancestor of two individuals in separate species, that that ancestor is found exactly at the speciation event. It can be much more ancient than that.
If you know the speciation time, say from the fossil record, you do not necessarily know the divergence time of the DNA. Conversely, if you use the molecular clock to date the split between two species, you are not dating the actual speciation time but the DNA divergence time; the speciation time is likely to be more recent.
That the toplogy can be different than the species tree can be seen if you consider two speciation events close in time. Consider two "individuals", one from each of the two closest related species. These can have a most recent common ancestor in their shared common ancestor in the time between the first and the second speciation event
or they can have a most recent common ancestor further back in time than the first speciation event, in which case an "indivdual" from the third species might share a common ancestor with one of them more recent.
Just to avoid confusion, when I say "individual" I don't actually mean individual (which is why I quote the first). There are no present day humans more related to chimps than others -- although you sometimes get that impression.
The time since the speciation event is such that all humans (or chimps or gorillas) will share common ancestors much more recent than the speciation events.
The process involves recombinations, however, so if we trace a single individual's genealogy back in time, the nucleotides will split apart and join up again in a stocastic process,
and at the time of the speciation event they will be distributed on a number of different chromosomes ("individuals")
and it is these DNA chunks that can end up having different topologies than the species topology.
Different segments of the genome will have different divergence times and possibly different toplogies.
When we talk about gene trees (in contrast to species trees), we are talking about the trees for the individual segments of our genome, and when they differ significantly from the species tree (in either branch lengths or topology) inferring the species tree can be problematic.
Inferring species trees and gene trees
The two applications that I used as an excuse for writing this post concerns inferring species trees from gene trees, or jointly with gene trees. Both takes statistical approaches; one Bayesian the other Maximum Likelihood.
The first method, BEST (Liu 2008) jointly estimates gene trees and the species tree from alignments. The idea is that the species tree puts constraints on the coalescence times of the gene trees (they must be compatible with the species tree, so two species in a gene tree do not join up more recent than the speciation event, and the distribution of the tree is given by the underlying coalescence process) and conversely the gene trees put constraints on the species tree (the same constraint about coalescence times) so you can sample one tree when keeping the other fixed, and then use an MCMC framework to sample over trees.
This way you can sample over the posterior probability of both species trees and gene trees. The process is somewhat time consuming, so probably not practical for genome wide analysis, but nice in its (relative) simplicity nonetheless.
The other tool, STEM (Kubatko et al. 2009) takes a set of gene trees as input and estimates the species tree in a Maximul Likelihood approach. Again this is done by considering the constraints that the gene trees put on the species tree (together with the underlying coalescence process, of course).
One weakness in both method is the assumption that the gene trees correspond to true underlying coalescence trees. This is unlikely to be true for real gene trees for two main reasons: First, the gene trees are inferred and therefore can be incorrect, and second, in a coalescence process with recombination (the process where incomplete lineage sorting occur) it is unlikely that recombination events only occur between and not within the regions used to infer the gene trees.
The first problem, that the gene trees can be incorrectly inferred, is less of a problem for BEST, since it jointly infers the trees, so sampling an incorrect tree from time to time can be corrected through the MCMC run. I could imagine it being more of a problem for STEM.
The second problem, I think, is a major problem for both. There are two "sub-issues" here. One, they assume that there is no recombination within a gene, and second, that different genes are independent (essentially have enough recombination between them that they are in linkage equilibrium).
If you only consider genes far apart, the second assumption is probably not much of a problem, but it does mean that the method cannot scale to whole genome analysis, even if it was computationally feasible, since you cannot have genes close to each other without them being at least slightly correlated.
The first issue is more serious, I think. If you consider a DNA segment long enough that you can reliably infer its genealogy, it is unlikely that there are no recombinations within that segment, and those are as likely to give you different coalescence times and different topologies as the recombinations between the genes.
The problem with that is, that if you infer a single topology for a region that really have more, you are unlikely to recover any meaningful genealogy.
I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment. That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn't know.
I plan to look into it, though, when I get the time... which won't be any time soon, unfortunately, since I am pretty swamped in other projects right now.
- L. Liu (2008). BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24 (21), 2542-2543 DOI: 10.1093/bioinformatics/btn484
- L. S. Kubatko, B. C. Carstens, L. L. Knowles (2009). STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence Bioinformatics DOI: 10.1093/bioinformatics/btp079