Posts Tagged ‘phylogenetic inference’

On gene trees and species trees

Thursday, February 12th, 2009

Last week I reviewed a paper on inferring species trees based on gene trees, and I so wanted to write about it here, but of course I have to patiently wait until the paper is published.

However, today there appeared an application note in Bioinformatics (advanced access) on the topic — and there was another application note a few months back — so this gives me an excuse to write a few words about speciation trees and gene trees.

The relationship between gene trees and species trees is one of my own research interests, although not the inferrence of the trees.  In our CoalHMM work (Hobolth et al 2007), we use the relationship between gene trees to infer information about the speciation events.  Much more on that on a later day, though.

Species trees and gene trees

When you think about phylogenetic inference, you typically think about the relationship between species in a tree.  So, for instance, the relationship between human, chimp, and gorilla would group human and chimp together and have gorilla as an outgroup.

This is the relationship between the species, but it is not the whole story.  There is population genetics going on within the branches of this tree, which we can model as a coalescence process.  This is a generalisation of the Wright-Fisher process that is mathematically easier to work with, but for the points I will make here it might be easier to think of the Wright-Fisher process.

The Wright-Fisher process is a very simple mathematical model of the evolution of a population.  It says that we have a set of discrete non-overlapping generations, where each new generation is sampled from the previous by sampling at random with replacement.  So you start out with a set of of N individuals in the first generation and then you create the next generation by N times selecting a parent from the first population at random, and copy him to the next generation.

For the next generation you do the same, but this time you sample from the second generation (the one you just created)…

…and you continue this process for as many generations as you need.

This is how the process runs within a population.

When you have a speciation event, parts of the population branches off the other part — for some reason or other — and you can sample individuals in the two separate species only from individuals in the same species.

An example with two speciation events is shown below:

This process, running inside the species tree, has two consequences: DNA divergence times do not correspond to speciation times, and the toplogies for the “individuals” do not necessarily correspond to the species topology.

The first is obvious when you think about it.  The speciation even is the most recent time after which no individuals in two separate species can sample from the same individuals in the previous generation, so but that does not mean that when you consider the most recent ancestor of two individuals in separate species, that that ancestor is found exactly at the speciation event.  It can be much more ancient than that.

If you know the speciation time, say from the fossil record, you do not necessarily know the divergence time of the DNA.  Conversely, if you use the molecular clock to date the split between two species, you are not dating the actual speciation time but the DNA divergence time; the speciation time is likely to be more recent.

That the toplogy can be different than the species tree can be seen if you consider two speciation events close in time.  Consider two “individuals”, one from each of the two closest related species.  These can have a most recent common ancestor in their shared common ancestor in the time between the first and the second speciation event

or they can have a most recent common ancestor further back in time than the first speciation event, in which case an “indivdual” from the third species might share a common ancestor with one of them more recent.

Just to avoid confusion, when I say “individual” I don’t actually mean individual (which is why I quote the first).  There are no present day humans more related to chimps than others — although you sometimes get that impression.

The time since the speciation event is such that all humans (or chimps or gorillas) will share common ancestors much more recent than the speciation events.

The process involves recombinations, however, so if we trace a single individual’s genealogy back in time, the nucleotides will split apart and join up again in a stocastic process,

and at the time of the speciation event they will be distributed on a number of different chromosomes (“individuals”)

and it is these DNA chunks that can end up having different topologies than the species topology.

Different segments of the genome will have different divergence times and possibly different toplogies.

When we talk about gene trees (in contrast to species trees), we are talking about the trees for the individual segments of our genome, and when they differ significantly from the species tree (in either branch lengths or topology) inferring the species tree can be problematic.

Inferring species trees and gene trees

The two applications that I used as an excuse for writing this post concerns inferring species trees from gene trees, or jointly with gene trees.  Both takes statistical approaches; one Bayesian the other Maximum Likelihood.

The first method, BEST (Liu 2008) jointly estimates gene trees and the species tree from alignments.  The idea is that the species tree puts constraints on the coalescence times of the gene trees (they must be compatible with the species tree, so two species in a gene tree do not join up more recent than the speciation event, and the distribution of the tree is given by the underlying coalescence process) and conversely the gene trees put constraints on the species tree (the same constraint about coalescence times) so you can sample one tree when keeping the other fixed, and then use an MCMC framework to sample over trees.

This way you can sample over the posterior probability of both species trees and gene trees.  The process is somewhat time consuming, so probably not practical for genome wide analysis, but nice in its (relative) simplicity nonetheless.

The other tool, STEM (Kubatko et al. 2009) takes a set of gene trees as input and estimates the species tree in a Maximul Likelihood approach.  Again this is done by considering the constraints that the gene trees put on the species tree (together with the underlying coalescence process, of course).

One weakness in both method is the assumption that the gene trees correspond to true underlying coalescence trees.  This is unlikely to be true for real gene trees for two main reasons:  First, the gene trees are inferred and therefore can be incorrect, and second, in a coalescence process with recombination (the process where incomplete lineage sorting occur) it is unlikely that recombination events only occur between and not within the regions used to infer the gene trees.

The first problem, that the gene trees can be incorrectly inferred, is less of a problem for BEST, since it jointly infers the trees, so sampling an incorrect tree from time to time can be corrected through the MCMC run.  I could imagine it being more of a problem for STEM.

The second problem, I think, is a major problem for both.  There are two “sub-issues” here.  One, they assume that there is no recombination within a gene, and second, that different genes are independent (essentially have enough recombination between them that they are in linkage equilibrium).

If you only consider genes far apart, the second assumption is probably not much of a problem, but it does mean that the method cannot scale to whole genome analysis, even if it was computationally feasible, since you cannot have genes close to each other without them being at least slightly correlated.

The first issue is more serious, I think.  If you consider a DNA segment long enough that you can reliably infer its genealogy, it is unlikely that there are no recombinations within that segment, and those are as likely to give you different coalescence times and different topologies as the recombinations between the genes.

The problem with that is, that if you infer a single topology for a region that really have more, you are unlikely to recover any meaningful genealogy.

I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment.  That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn’t know.

I plan to look into it, though, when I get the time… which won’t be any time soon, unfortunately, since I am pretty swamped in other projects right now.

Citations

  1. L. Liu (2008). BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24 (21), 2542-2543 DOI: 10.1093/bioinformatics/btn484
  2. L. S. Kubatko, B. C. Carstens, L. L. Knowles (2009). STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence Bioinformatics DOI: 10.1093/bioinformatics/btp079

43-65=-22

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Friday, January 16th, 2009

There’s an interesting paper in the current issue of Bioinformatics that I’ve just finished reading:

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Webb et al. Bioinformatics 25(2) 197-203

Abstract

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.

Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

The paper presents a new method for analysing sequences that have undergone recombination.

When sequences have not undergone recombination, a nice methodology for analysing them is the PhyloHMM (PDF). With this method, you have a hidden Markov model where the emission probability is determined by a phylogeny, and usually computed using Felsenstein’s pruning algorithm.

When there is recombination, the problem is that there are more than one topology for the underlying phylogeny, and if you do not know the topologies you cannot immediately calculate the emission probabilities.

You can instead model the unknown topologies as hidden states. This approach was taken by Husmeier and McGuire (2003) and is also the approach we take in our CoalHMM method (Hobolth et al 2007).

This approach doesn’t scale, however, since the number of possible toplogies grows super-exponential with the number of species.

In this paper the solve the problem by using only a few topologies as states in the HMM, but sampling over all possible topologies to be used, in an MCMC approach.  Ideally the number of topologies should be variable, but that requires a reversible jump MCMC and they haven’t implemented that.  Still, it seems to work very well.

I remember discussing the problem with both Alex and Chris when I was last in Oxford, but back then it didn’t work so well, so I am happy to read that they’ve solved the problems. Properly handling recombination and changing topologies is important for accurate sequence analysis.


A. Webb, J. M. Hancock, C. C. Holmes (2008). Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, 25 (2), 197-203 DOI: 10.1093/bioinformatics/btn607
16-29=-13

How deterministic is neighbour-joining anyway?

Wednesday, July 23rd, 2008

Whenever we try to publish an algorithm for speeding up neighbour-joining, at least one reviewer will ask us if we have checked that our new algorithm generates the same tree as the “canonical” neighbour-joining algorithm does.

This sounds reasonable, but it isn’t really.  The neighbour-joining method is not as deterministic as you’d think.

Well, it is a deterministic algorithm in the sense that the same input will always produce the same output, but equivalent input will not necessarily produce the same output.

The neighbour-joining algorithm

The neighbour-joining algorithm is phylogenetic inference method that takes as input a matrix of pairwise distances between taxa and outputs a tree connecting the taxa. It is a heuristic for producing “minimum evolution” trees, that is trees where the total branch length is as small as possible (while still, of course, reflecting the distances between the input taxa).  A heuristic is needed, since optimising the minimal evolution criteria is computationally intractable, and the neighbour-joining algorithm is a heuristic that strikes a good balance between computational efficiency and inference accuracy.

The heuristic is not a “heuristic optimisation algorithm” in the sense that simulated annealing or evolutionary algorithms is.  It is not randomised. It is a greedy algorithm, that in each step makes an optimal choice, which will hopefully eventually leads to a good solution (although locally optimial choices are not guaranteed to lead to a global optimal choice).

A problem with a greedy algorithm is that local choices can drastically change the final result.  Not all greedy algorithms, of course.  Some are guaranteed to reach a global optimum and if there is only a single global optimum then the local choices will not affect the final outcome (but if there is more than one global optimum the local choices will also influence the final result).

In the neighbour-joining algorithm, the greedy step consists of joining two clusters that minimise a certain criteria.  If two or more pairs minimise the criteria, a choice must be made, and that choice will affect the following steps.

Equivalent distance matrices  does not mean identical input

Unless the choice is resolved randomly, and I’m assuming that it isn’t, then the algorithm is deterministic.  It could, for example, always choose the first or last pair minimising the criteria.  How this choice should be resolved is not specified in the algorithm, so anything goes, really.

If the choice is resolved by picking the first or last pair, then exact form of the input matters.

The distance matrix neighbour-joining uses as input gives us all the pair-wise distances between our taxa, but to represent it we need to order the taxa to get row and column numbers.  This ordering is completely arbitrary, and any ordering gives us an equivalent distance matrix. Not an identical one, though.

If we resolve the local choice in a way that depend on the row and column order, then two equivalent input matrices might produce very different output trees.

An experiment

To see how much this matters in practice, I made a small experiment.  I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa.  You can see the matrices here.  There are 39 of them, but I had to throw two of them away –  AIRC and Acyl_CoA_thio — since they had taxons with the same name, and with that I cannot compare the resulting trees.  It’s matrices I created for this paper, but I don’t remember the details… it is not important for the experiment, though.

RF distance for “identical” NJ treesFor each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbour-joining trees and finally computed the Robinson-Foulds between all pairs of these trees.

To see a plot of the result, click on the figure on the left.

As you can see, you can get quite different trees out of equivalent distance matrices and on realistic data too.

I didn’t check if the differences between the trees are correlated with how much support the splits have in the data.  Looking at bootstrap values for the trees might tell us something about that.  Anyway, that is not the take home message I want to give here.  That is that it doesn’t necessarily make sense to talk the neighbour-joining tree for a distance matrix.

Neighbour-joining is less deterministic than you might think!