Mapping human genetic ancestry

Yesterday I read the paper

Mapping human genetic ancestry I. Ebersberger et al.Molecular Biology and Evolution 2007 24(10):2266-2276

that addresses the same problem that we addressed in

Genomic relationships and speciation times of human, chimpanzee and gorilla infered from a coalescent hidden Markov model A. Hobolth et al.PLoS Genetics 2007 3(2): doi:10.1371/journal.pgen.0030007

although taking a different approach to the problem but using a lot more data.

Tracing the ancestry of the human genome

Species trees and gene treesHuman’s closest living relatives are the chimps and the closest relatives to human and chimps are the gorillas, but the species are so closely related that not all of the genome follows the species genealogy. Click on the figure on the right to get an illustration of this.The reason this happens is that as we trace the history of a piece of our DNA back in time, we will necessarily find the most recent common ancestor of humans and chimps further back in time than the speciation time of humans and chimps. If this time is so far back that it also precedes the speciation time of the human/chimp ancestor and the gorilla ancestor, then the most recent common ancestor of chimps and gorillas, or humans and gorillas, might be younger than the most recent common ancestor of all the species.Looking at the DNA of the three species we can infer the average time in the past where the DNA splits into the different species and using coalescent theory we can then infer the speciation times.In Hobolth et al. we approximated the coalescent process using a hidden Markov model which enabled us to efficiently analyse large alignments of DNA sequences and from this extract the parameters needed to infer speciation times, information about the diversity in ancestral species and to annotate the alignments with the most likely genealogy e.g. showing us in which part of our genome we are closer related to gorillas than to chimps.

CoalHMM

We applied this to five large alignments, but covering only a small fraction of the entire genome.In Ebersberger et al. they construct a large number of (smaller) alignments covering the entire genome and consider the same problem in analysing this data.The statistical model they use is slightly less sophisticated than what we did, but that is probably more than compensated for by the much larger data-set. What they do is construct a single tree for each alignment, by picking the most likely phylogeny of all the possible, discarding alignments when there is no clear winner.They then use coalescent theory to infer the diversity of the ancestral species measured as the parameter Ne (effective population size) — essentially doing the same as we did — but as far as I understand they equate DNA divergence time with speciation time which strictly speaking is incorrect (I might be wrong here, I didn’t check in detail how they inferred the time interval between human/chimp divergence and their divergence from the gorilla).

Diversity of the human-chimp ancestor along the human genome

A plot of diversity is shown on the bottom half of the figure on the right. Click to enlarge.

Their estimates of Ne are pretty close to ours (65,000 ± 30,000). This is pretty good news, considering that the results come about using different methods (although based on the same underlying theory).

However, the assumptions we put into the analysis differs. To calibrate the molecular clock in the analysis we both use the divergence time from the orangutan, but where we used 18 million years (Myr) ago they use 16Myr ago. The generation time is also very important in estimating the divergence and where we used 25 years as the average generation time they used 20 years. Our estimate of generation time is a bit on the high side — Ebersberger et al. calls unrealistically high — but we really had no idea what to use here when we did our analysis.

How much have these assumptions affected the results?

With help from Julien Dutheil — who has just re-written the entire CoalHMM software — I got the numbers our analysis would have obtained had we used the assumptions from Ebersberger et al. The human-chimp divergence we estimate is 5.1 Myr (as opposed to their 5.7) and the divergence with the gorilla we estimate to 8.4 Myr (as opposed to their 7.8). This is reasonably close enough to be the same. When we then estimate the speciation time — where the generation time assumption is important — we get 3.6Myr for the human/chimp speciation and 5.7 Myr for the (human/chimp)/gorilla speciation. These look very recent to me, and I don’t fully trust them. I have seen numbers around 4 Myr for the closest distance between human and chimp, but the fossil record just doesn’t match that.

For the Ne estimate, the new assumptions give us a whooping 81,000 for the human/chimp ancestor. I’m not really sure why. Using their assumptions moves us further from their estimates. This is probably worth looking into.


Citations, for Research Blogging:Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., von Haeseler, A. (2007). Mapping Human Genetic Ancestry. Molecular biology and evolution, 24(10), 2266-2276.Hobolth, A., Christensen, O.F., Mailund, T., Schierup, M.H. (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLoS Genetics, 3(2), e7. DOI: 10.1371/journal.pgen.0030007

A tale of two citations

There’s paper in Nature following up on the duplicate citation paper I wrote about earlier.

A tale of two citations
Mounir Errami and Harold Garner
Nature 451, 397-399 (24 January 2008)

This time they analysed Medline for duplicated publications, reaching roughly the same conclusions as before.

Alignment bias in genomics

I have previously written a bit about how optimal alignment algorithms introduce an alignment bias and even done some work on it myself (currently submitted for publication, so I cannot link to it yet). Today I saw a paper in the current issue of Science addressing the same problem.

A summary can be found in

Lining Up to Avoid Bias

Antonis Rokas

Science Vol. 319. no. 5862, pp. 416 – 417

and the full paper (probably requires a subscription) is

Alignment Uncertainty and Genomic Analysis

Karen M. Wong, Marc A. Suchard, and John P. Huelsenbech

Science Vol. 319. no. 5862, pp. 473 – 476

The problem with alignments

I’ve already described the problem in the previous post, where I used the examples from Gerton Lunter’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

although there the focus was on the problems with indels. Of course, without indels there simply isn’t any problem with alignment, so that is not as unreasonable as it might sound.

Essentially, the problem is that we use algorithms to infer optimal alignments and then treat these alignments as absolute truth, ignoring the uncertainty in the inference.

In Wong et al. they compare seven different alignment algorithms and consider typical evolutionary analysis — inference of phylogenies and detecting selection — based on the inferred alignments, and see a large variability of analysis result dependent on inference method.

The solution proposed in Wong et al. is the same as Gerton proposes: statistical alignmentet methods. Quoting Wong et al.:

The problem of alignment uncertainty in genomic studies, identified here, is not a problem of sloppy analysis. Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place.

In a comparative genomics study, we advocate that alignment be treated as a random variable, and inferences of parameters of interest to the genomicist, such as the amount of nonsynonymous divergence or the phylogeny, consider the different possible alignments in proportion to their probability.

Of course, this is what the statistical alignment people in Oxford have been trying for years and it is not quite as easy as it sounds.


Citations, for Research Blogging:Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532

Petri Nets and Systems Biology

I did my PhD in the Coloured Petri Nets group here in Aarhus, but since I finished my PhD and changed my research field to bioinformatics I haven’t touched Petri nets. Now, that I’m stating to get interested in systems biology, I seem to run into them again and again.

A lot of people seem interested in modelling biological systems in various types of Petri nets. I sort of see why. Petri nets have been used in modelling a wide variety of dynamic systems, so why not apply them to biological systems as well?

The papers I’ve read have left me a bit disappointed, though.

Most of the papers I’ve read seem to just add extensions to Petri nets for the sake of adding the extensions (or as excuse to get a paper published, you pick). I won’t blame Petri nets nor systems biology for this, though. I’ve seen this in every single formalism I’ve read up on. It is a kind of feature creep that we computer scientists just cannot seem to avoid. Whenever we see an ever so tiny potential problem with a computer language, we immediately find a way to fix it and rarely do we worry if it is worth the problem to fix or if what it is fixing is really that much of a problem in the first place. For some reason, we just cannot keep things simple.

Anyway, I’m going to ignore this particular problem in this post and instead ask, what do Petri nets add to systems biology?

What do Petri nets add to systems biology?

Most papers I’ve read seem to just use Petri nets as a front-end for some other formalism. Some use Petri nets as a graphical way of specifying differential equations or some use (stochastic) Petri nets just as a front-end for Gillespie simulations.

If Petri nets are just used as a front-end for something else, then is that really the way to go? Sure, it is probably easier to get a feeling for a system by looking at a network than by looking at a set of coupled differential equations, but the lack of compositionality in Petri nets does mean that a lot of systems end up as “spaghetti networks”, so perhaps process algebra was a better approach here? The same goes for setting up stochastic simulations.

Don’t get me wrong, I do like Petri nets. I especially like that their graphical representation. I am just a bit disappointed that that is all they seem to bring to the table.

So far, the only paper I’ve seen that actually uses “good old” Petri net theory — p– and t-invariants, in this case — is the paper I read today (and incidentally the paper that got me thinking about all of this):

Petri net-based method for the analysis of the dynamics of signal propagation in signaling pathways

Simon Hardy and Pierre N. Robillard

Bioinformatics Advance Access published online on November 22, 2007

and even that paper seems to me to basically be modelling using differential equations. I might be wrong here, though, I haven’t read it that thoroughly yet. They do extract some signalling information from simulations and I didn’t quite get to which degree they need the net structure (as opposed to just the set of ODEs) to extract that.

Am I reading the wrong papers, or just missing the point here? If you know of any papers I really ought to read to get the point of using Petri nets in systems biology, then please let me know!

Practising what I preach?

Now after reading through all this it might surprise you that I will use stochastic Petri nets in the systems biology class I teach with Casten Wiuf this term.

It is not so much because of the nets, though. We want to use stochastic processes in the class and compare them with differential equation modelling to contrast deterministic (“large number of molecules”) models. The text book we use

Stochastic modelling for systems biology

Stochastic modelling for systems biology

Darren J. Wilkinson

Chapman & Hall/CRC, 2006.

uses stochastic Petri nets, and that made the choice for us.

But is it the right choice? Would I actually use Petri nets myself if I had to model a biological system?

Honestly, I do not know. I am very familiar with nets from my PhD work, but not in the context of systems biology. I wouldn’t know the right tools to use. I could easily end up programming simulators or numerical analysis methods myself, and then I am not sure I would gain much from starting out with nets.

I guess I really need to read up on Petri nets in systems biology… but where should I start?