The 1000 genomes project

I’m absolutely thrilled that we have reached the technological level where it is possible to sequence 1000 genomes just to learn more about human genetic variation.We have learned a lot from the HapMap project about common variation and this knowledge has lead to an explosion in discoveries of genetic factors in several diseases. With actual sequencing of genomes we should also learn about less common genetic variation and who knows where that will take us?I’ve actually known about this project for a while from some of the people involved, but this is the first time I’ve seen it mentioned online, so I thought I would link to it today :)

Mapping human genetic ancestry

Yesterday I read the paper

Mapping human genetic ancestry I. Ebersberger et al.Molecular Biology and Evolution 2007 24(10):2266-2276

that addresses the same problem that we addressed in

Genomic relationships and speciation times of human, chimpanzee and gorilla infered from a coalescent hidden Markov model A. Hobolth et al.PLoS Genetics 2007 3(2): doi:10.1371/journal.pgen.0030007

although taking a different approach to the problem but using a lot more data.

Tracing the ancestry of the human genome

Species trees and gene treesHuman’s closest living relatives are the chimps and the closest relatives to human and chimps are the gorillas, but the species are so closely related that not all of the genome follows the species genealogy. Click on the figure on the right to get an illustration of this.The reason this happens is that as we trace the history of a piece of our DNA back in time, we will necessarily find the most recent common ancestor of humans and chimps further back in time than the speciation time of humans and chimps. If this time is so far back that it also precedes the speciation time of the human/chimp ancestor and the gorilla ancestor, then the most recent common ancestor of chimps and gorillas, or humans and gorillas, might be younger than the most recent common ancestor of all the species.Looking at the DNA of the three species we can infer the average time in the past where the DNA splits into the different species and using coalescent theory we can then infer the speciation times.In Hobolth et al. we approximated the coalescent process using a hidden Markov model which enabled us to efficiently analyse large alignments of DNA sequences and from this extract the parameters needed to infer speciation times, information about the diversity in ancestral species and to annotate the alignments with the most likely genealogy e.g. showing us in which part of our genome we are closer related to gorillas than to chimps.


We applied this to five large alignments, but covering only a small fraction of the entire genome.In Ebersberger et al. they construct a large number of (smaller) alignments covering the entire genome and consider the same problem in analysing this data.The statistical model they use is slightly less sophisticated than what we did, but that is probably more than compensated for by the much larger data-set. What they do is construct a single tree for each alignment, by picking the most likely phylogeny of all the possible, discarding alignments when there is no clear winner.They then use coalescent theory to infer the diversity of the ancestral species measured as the parameter Ne (effective population size) — essentially doing the same as we did — but as far as I understand they equate DNA divergence time with speciation time which strictly speaking is incorrect (I might be wrong here, I didn’t check in detail how they inferred the time interval between human/chimp divergence and their divergence from the gorilla).

Diversity of the human-chimp ancestor along the human genome

A plot of diversity is shown on the bottom half of the figure on the right. Click to enlarge.

Their estimates of Ne are pretty close to ours (65,000 ± 30,000). This is pretty good news, considering that the results come about using different methods (although based on the same underlying theory).

However, the assumptions we put into the analysis differs. To calibrate the molecular clock in the analysis we both use the divergence time from the orangutan, but where we used 18 million years (Myr) ago they use 16Myr ago. The generation time is also very important in estimating the divergence and where we used 25 years as the average generation time they used 20 years. Our estimate of generation time is a bit on the high side — Ebersberger et al. calls unrealistically high — but we really had no idea what to use here when we did our analysis.

How much have these assumptions affected the results?

With help from Julien Dutheil — who has just re-written the entire CoalHMM software — I got the numbers our analysis would have obtained had we used the assumptions from Ebersberger et al. The human-chimp divergence we estimate is 5.1 Myr (as opposed to their 5.7) and the divergence with the gorilla we estimate to 8.4 Myr (as opposed to their 7.8). This is reasonably close enough to be the same. When we then estimate the speciation time — where the generation time assumption is important — we get 3.6Myr for the human/chimp speciation and 5.7 Myr for the (human/chimp)/gorilla speciation. These look very recent to me, and I don’t fully trust them. I have seen numbers around 4 Myr for the closest distance between human and chimp, but the fossil record just doesn’t match that.

For the Ne estimate, the new assumptions give us a whooping 81,000 for the human/chimp ancestor. I’m not really sure why. Using their assumptions moves us further from their estimates. This is probably worth looking into.

Citations, for Research Blogging:Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., von Haeseler, A. (2007). Mapping Human Genetic Ancestry. Molecular biology and evolution, 24(10), 2266-2276.Hobolth, A., Christensen, O.F., Mailund, T., Schierup, M.H. (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLoS Genetics, 3(2), e7. DOI: 10.1371/journal.pgen.0030007

A tale of two citations

There’s paper in Nature following up on the duplicate citation paper I wrote about earlier.

A tale of two citations
Mounir Errami and Harold Garner
Nature 451, 397-399 (24 January 2008)

This time they analysed Medline for duplicated publications, reaching roughly the same conclusions as before.

Alignment bias in genomics

I have previously written a bit about how optimal alignment algorithms introduce an alignment bias and even done some work on it myself (currently submitted for publication, so I cannot link to it yet). Today I saw a paper in the current issue of Science addressing the same problem.

A summary can be found in

Lining Up to Avoid Bias

Antonis Rokas

Science Vol. 319. no. 5862, pp. 416 – 417

and the full paper (probably requires a subscription) is

Alignment Uncertainty and Genomic Analysis

Karen M. Wong, Marc A. Suchard, and John P. Huelsenbech

Science Vol. 319. no. 5862, pp. 473 – 476

The problem with alignments

I’ve already described the problem in the previous post, where I used the examples from Gerton Lunter’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185

although there the focus was on the problems with indels. Of course, without indels there simply isn’t any problem with alignment, so that is not as unreasonable as it might sound.

Essentially, the problem is that we use algorithms to infer optimal alignments and then treat these alignments as absolute truth, ignoring the uncertainty in the inference.

In Wong et al. they compare seven different alignment algorithms and consider typical evolutionary analysis — inference of phylogenies and detecting selection — based on the inferred alignments, and see a large variability of analysis result dependent on inference method.

The solution proposed in Wong et al. is the same as Gerton proposes: statistical alignmentet methods. Quoting Wong et al.:

The problem of alignment uncertainty in genomic studies, identified here, is not a problem of sloppy analysis. Many comparative genomics studies are carefully performed and reasonable in design. However, even carefully designed and carried out analyses can suffer from these types of problems because the methods used in the analysis of the genomic data do not properly accommodate alignment uncertainty in the first place.

In a comparative genomics study, we advocate that alignment be treated as a random variable, and inferences of parameters of interest to the genomicist, such as the amount of nonsynonymous divergence or the phylogeny, consider the different possible alignments in proportion to their probability.

Of course, this is what the statistical alignment people in Oxford have been trying for years and it is not quite as easy as it sounds.

Citations, for Research Blogging:Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532