Archive for February 12th, 2009

On gene trees and species trees

Thursday, February 12th, 2009

Last week I reviewed a paper on inferring species trees based on gene trees, and I so wanted to write about it here, but of course I have to patiently wait until the paper is published.

However, today there appeared an application note in Bioinformatics (advanced access) on the topic -- and there was another application note a few months back -- so this gives me an excuse to write a few words about speciation trees and gene trees.

The relationship between gene trees and species trees is one of my own research interests, although not the inferrence of the trees.  In our CoalHMM work (Hobolth et al 2007), we use the relationship between gene trees to infer information about the speciation events.  Much more on that on a later day, though.

Species trees and gene trees

When you think about phylogenetic inference, you typically think about the relationship between species in a tree.  So, for instance, the relationship between human, chimp, and gorilla would group human and chimp together and have gorilla as an outgroup.

This is the relationship between the species, but it is not the whole story.  There is population genetics going on within the branches of this tree, which we can model as a coalescence process.  This is a generalisation of the Wright-Fisher process that is mathematically easier to work with, but for the points I will make here it might be easier to think of the Wright-Fisher process.

The Wright-Fisher process is a very simple mathematical model of the evolution of a population.  It says that we have a set of discrete non-overlapping generations, where each new generation is sampled from the previous by sampling at random with replacement.  So you start out with a set of of N individuals in the first generation and then you create the next generation by N times selecting a parent from the first population at random, and copy him to the next generation.

For the next generation you do the same, but this time you sample from the second generation (the one you just created)...

...and you continue this process for as many generations as you need.

This is how the process runs within a population.

When you have a speciation event, parts of the population branches off the other part -- for some reason or other -- and you can sample individuals in the two separate species only from individuals in the same species.

An example with two speciation events is shown below:

This process, running inside the species tree, has two consequences: DNA divergence times do not correspond to speciation times, and the toplogies for the "individuals" do not necessarily correspond to the species topology.

The first is obvious when you think about it.  The speciation even is the most recent time after which no individuals in two separate species can sample from the same individuals in the previous generation, so but that does not mean that when you consider the most recent ancestor of two individuals in separate species, that that ancestor is found exactly at the speciation event.  It can be much more ancient than that.

If you know the speciation time, say from the fossil record, you do not necessarily know the divergence time of the DNA.  Conversely, if you use the molecular clock to date the split between two species, you are not dating the actual speciation time but the DNA divergence time; the speciation time is likely to be more recent.

That the toplogy can be different than the species tree can be seen if you consider two speciation events close in time.  Consider two "individuals", one from each of the two closest related species.  These can have a most recent common ancestor in their shared common ancestor in the time between the first and the second speciation event

or they can have a most recent common ancestor further back in time than the first speciation event, in which case an "indivdual" from the third species might share a common ancestor with one of them more recent.

Just to avoid confusion, when I say "individual" I don't actually mean individual (which is why I quote the first).  There are no present day humans more related to chimps than others -- although you sometimes get that impression.

The time since the speciation event is such that all humans (or chimps or gorillas) will share common ancestors much more recent than the speciation events.

The process involves recombinations, however, so if we trace a single individual's genealogy back in time, the nucleotides will split apart and join up again in a stocastic process,

and at the time of the speciation event they will be distributed on a number of different chromosomes ("individuals")

and it is these DNA chunks that can end up having different topologies than the species topology.

Different segments of the genome will have different divergence times and possibly different toplogies.

When we talk about gene trees (in contrast to species trees), we are talking about the trees for the individual segments of our genome, and when they differ significantly from the species tree (in either branch lengths or topology) inferring the species tree can be problematic.

Inferring species trees and gene trees

The two applications that I used as an excuse for writing this post concerns inferring species trees from gene trees, or jointly with gene trees.  Both takes statistical approaches; one Bayesian the other Maximum Likelihood.

The first method, BEST (Liu 2008) jointly estimates gene trees and the species tree from alignments.  The idea is that the species tree puts constraints on the coalescence times of the gene trees (they must be compatible with the species tree, so two species in a gene tree do not join up more recent than the speciation event, and the distribution of the tree is given by the underlying coalescence process) and conversely the gene trees put constraints on the species tree (the same constraint about coalescence times) so you can sample one tree when keeping the other fixed, and then use an MCMC framework to sample over trees.

This way you can sample over the posterior probability of both species trees and gene trees.  The process is somewhat time consuming, so probably not practical for genome wide analysis, but nice in its (relative) simplicity nonetheless.

The other tool, STEM (Kubatko et al. 2009) takes a set of gene trees as input and estimates the species tree in a Maximul Likelihood approach.  Again this is done by considering the constraints that the gene trees put on the species tree (together with the underlying coalescence process, of course).

One weakness in both method is the assumption that the gene trees correspond to true underlying coalescence trees.  This is unlikely to be true for real gene trees for two main reasons:  First, the gene trees are inferred and therefore can be incorrect, and second, in a coalescence process with recombination (the process where incomplete lineage sorting occur) it is unlikely that recombination events only occur between and not within the regions used to infer the gene trees.

The first problem, that the gene trees can be incorrectly inferred, is less of a problem for BEST, since it jointly infers the trees, so sampling an incorrect tree from time to time can be corrected through the MCMC run.  I could imagine it being more of a problem for STEM.

The second problem, I think, is a major problem for both.  There are two "sub-issues" here.  One, they assume that there is no recombination within a gene, and second, that different genes are independent (essentially have enough recombination between them that they are in linkage equilibrium).

If you only consider genes far apart, the second assumption is probably not much of a problem, but it does mean that the method cannot scale to whole genome analysis, even if it was computationally feasible, since you cannot have genes close to each other without them being at least slightly correlated.

The first issue is more serious, I think.  If you consider a DNA segment long enough that you can reliably infer its genealogy, it is unlikely that there are no recombinations within that segment, and those are as likely to give you different coalescence times and different topologies as the recombinations between the genes.

The problem with that is, that if you infer a single topology for a region that really have more, you are unlikely to recover any meaningful genealogy.

I did some simulations of this a while back, and the inferred genealogy can be really far from any of the true genealogies in the segment.  That were simulations with lots of recombinations, though, so how serious it is for the cases they consider, I wouldn't know.

I plan to look into it, though, when I get the time... which won't be any time soon, unfortunately, since I am pretty swamped in other projects right now.

Citations

  1. L. Liu (2008). BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, 24 (21), 2542-2543 DOI: 10.1093/bioinformatics/btn484
  2. L. S. Kubatko, B. C. Carstens, L. L. Knowles (2009). STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence Bioinformatics DOI: 10.1093/bioinformatics/btp079

--

43-65=-22

The economics of text books

Thursday, February 12th, 2009

Andrew Gelman was shocked to learn the price of text books. This lead to an interesting discussion on the economics of text books.

The price of a text book

Personally I have noticed that the price seems to be inversely proportional to how broad the topic of the book is.  The more specialised the topic, the higher the price.  1312 pages of general biology in a hardback for $129.30 (about $0.10 per page) versus 290 pages of coalescence theory for $73.38 (about $0.25 per page).  576 pages (and a CD-ROM) of linear algebra for $97.93 ($0.17 per page) versus 272 pages of category theory for $124 (about $0.46 per page).

These are just some random examples, of course, but it is the general impression I have.  I would love to see some hard numbers, though.

Anyway, it makes sense that books that likely sells in fewer copies are more expensive, to cover the cost of producing the book.  At least if there is an up front cost of producing the book that must be recovered.

I mailed around to those of my colleauges I know have written text books.  They all tell me that they got a small percentage of the sales (and all agreed that it was very little compared to the time it takes to write a book).  So it is not that there is an up front salary for the author that drives up the price of books.

That leaves all the editing and typesetting to recover, and that might be quite a lot.  Although I know from experience that the proof readers are payed very little.  Last time I got a few hundred dollars and a few free copies of the book.  I guess it can also be cheaper to mass produce a very large number of copies compared to a few, but I refuse to believe that the difference is more than a factor of two.

So I don't think the price is particularly driven by the expenses in producing a book.

The economics of text books

One point, mentioned in several of the comments at Gelman's blog is that the economics of text books is very inelastic.  The professor chooses a text book -- and might have no idea about the price of the book -- and the students are pretty much stuck with that book and cannot shop for cheaper alternatives.

There is a marked in the quality of the text book -- I assume that the professor will always go for what he considers the best book -- but not in the price.

I know that I get free "teachers copies" of the text books I use, and I very rarely think about the price unless it is brought to my attention which really only happenes if it is extremely expensive compared to the average text book.

In such a marked, there is little reason to lower the price.

It also explains why more specialised topics will have more expensive books.  There are fewer of the books, so less competition between the publishers.  Since price is not that important for the choice of text book, the publishers can raise it and still sell it.

Things to think about when you write a text book

Everyone I've asked tells me that what the author gets paid for writing a text book is not worth the time it takes to write it.

Not that it is not worth the time to write a text book, don't get me wrong, but the income from it is no motivation at all.

I have thought about writing a text book from time to time, when I am not satisfied with the choices I have for a class I teach.  This is always what has motivated me (but apparently not enough to actually do it).

I could imagine that it is the same for most text book writes.  At least the many that only write a few in a long academic career, not the few who cranks out text books all the time.

Now I'm thinking that if I ever end up writing a text book, why not just make it an e-book that I can give away for free?  It is not as if the money I could earn from selling it is worth much.

Sure, there is a certain charm in holding a paper book, but print-on-demand can take care of that, I guess (does anyone know the price of that?)

There is, of course, also the proof reading and editing, but I am confident that if you put a free e-book out there, you will get plenty of feedback if people starts using it.  And with my experience with first edition text books and their errata list, that is really the important feedback.

It is worth thinking about, at least.

--

43-64=-21