Neanderthal genome paper is out
Friday, May 7th, 2010What an exciting thing to wake up to! The neanderthal genome has now been published.
Read the buzz about it here:
while I go read the actual paper.
What an exciting thing to wake up to! The neanderthal genome has now been published.
Read the buzz about it here:
while I go read the actual paper.
From Science Daily:
Researchers at The University of Texas at Arlington have found the first solid evidence of horizontal DNA transfer, the movement of genetic material among non-mating species, between parasitic invertebrates and some of their vertebrate hosts.
Genome biologist Cédric Feschotte and postdoctoral researchers Clément Gilbert and Sarah Schaack found evidence of horizontal transfer of transposon from a South American blood-sucking bug and a pond snail to their hosts. A transposon is a segment of DNA that can replicate itself and move around to different positions within the genome. Transposons can cause mutations, change the amount of DNA in the cell and dramatically influence the structure and function of the genomes where they reside.
I heard about this in February where I was at a meeting where Cédric gave a talk.
What they have found is families of transposons in different branches of mammals that doesn’t seem to have been inherited from further up the phylogeny. Some distantly related mammals have them, but their close relations do not. They appear to have just popped out of no where (so Cédric calls them “space invaders”).
They seem to have entered the genomes at roughly the same time, a time where the ancestors of those species have lived in the same area, and what points to horizontal gene transfer is that parasites that would have fed on these animals do have the same transposon family.
His talk at the meeting was recorded (all the talks were) but I haven’t yet found the videos online so I guess they are still being processed or something. When I find them, I’ll let you know.
Daniel MacArthur discusses genome-wide association studies which has so far mainly found disease associated polymorphisms outside of genes.
The claim in question is that the tendency of GWAS to find disease associations outside of protein-coding genes is somehow a problem; but, as p-ter notes, there’s perfectly plausible reasons for disease risk variants to be found in non-coding regions.
Indeed, I think most of us working in genomics have seen the proliferation of non-coding hits in GWAS studies as a positive, in that it seems to be teaching us something new and unexpected about the underlying biology of human variation.
There is a problem with polymorphisms outside of genes. We generally have no idea how they functionally affect us to increase or decrease the disease risk. If we have no idea what a given polymorphism means in terms of function, it is harder to work out; we don’t really know where to start with figuring it out.
As far as I can see, though, that is the only problem with that.
That’s it, though, as far as I can see. If the polymorphism is statistically significant associated with the disease, and we can replicate this in independent data, then that is what the data is saying. It might be inconvenient, but tough luck! No one promised us that this would be easy.
Quoting from Gene Expression:
Their answer to this rhetorical question is that common SNPs (used on current genotyping platforms) are generally nonfunctional. The alternative, the evidence for which I’ll present here, is that our ability to predict functional SNPs is poor. In the phrase “no known function”, the emphasis should be on the word “known”.
GWA studies have been a great success in locating polymorphisms associated with disease, that we can actually replicate.
Sure, we are working with very large data sets here, and false positives is a major problem (see e.g. here and here), but this is a problem we can handle.

And sure, GWA lets us find only the CD/CV type of disease associations and not all diseases will follow this pattern, but with the success of GWA studies so far, I think it is fair to say that there are enough to be found here to make it worthwhile!
Below is the introduction text to some lecture notes I’m working on. I’m putting them up here to get some feedback, since this is the part in my lecture notes I am the least sure about. The rest of the notes will be on mathematical models, and I am pretty confident that I understand those, but my paleontology knowledge is shaky at best, so any corrections, comments or suggestions for papers I should read will be most welcome!
When estimating the evolutionary relationship between species we have two sources of data we can use to date when species diverged: fossil evidence and genetic evidence, the latter based on the assumption of the molecular clock that lets us estimate divergence time based on the observed differences between genomic sequences. Both are by their very nature biased, but in opposite direction. Dates based on fossil evidence gives us a lower bound on the speciation time, while genetic evidence gives us upper bounds on the speciation time [1].
Fossils can be dated reasonably accurate through physical or geological methods, but they rely on morphological differences between species. Morphological characteristics unique to one set of species, when found in a fossil, tells us that the given group of species diverged from other species before the time where the fossil species existed. Deciding which morphological features are unique to a given group of species is, of course, somewhat subjective, but ignoring this, the fossil date is only a lower bound of the species split since the morphological features will have to have evolved after the species split. How long it took for these features to evolve, plus how close the fossil is to the emergence of the features considering possible gaps in the fossil record, influences how tight the lower bound is.
For genetic data, on the other hand, population genetics in ancestral species influences the dating of species splits. The coalescence process [2] in population genetics means that when we consider two genomes in the same population, they have a most recent common ancestor (MRCA) some distance back in time. When considering two genomes from different species, the MRCA is found at a distance back in time first given by the divergence of the species, and then the divergence the two genomes have within the ancestral species.
The divergence of genomes within a species depends on the effective population size; a technical term referring to the the population size of reproducing genomes. The larger the effective population size, the further back in time the MRCA will be found. On average, the number of generations back in time the MRCA will be found is equal to the effective population size. So for two genomes in the same species, we expect their MRCA to be found 2Ne generations back in time, where Ne is the number of diploid individuals reproducing, and the factor of two because in a diploid population of (effective) size Ne, there are 2Ne genomes.
The genetic distance between two species is therefore given by the species split plus 2Ne generations, and the genetic distance is thus an upper bound on the species divergence.
For humans, the effective population size is ~10,000, so two random human genomes are expected to have diverged around 20,000 generations ago, or 400 thousand years ago (kya) assuming a generation time of 20 years along the lineages back to the MRCA. This puts the sequence divergence of humans, who’s species divergence is of course zero, back to a point before the evolution of modern Man and before the speciation between modern humans and Neanderthals.
We cannot directly observe the divergence between genomes, so genetic dating of speciation relies on the observed differences between genomic sequences. An underlying assumption when doing this is that mutations to genomic sequences occur at a constant rate through time, so the number of mutations are proportional to the time between the genomes; two times the divergence time, since mutations occur on both branches of the split.
We cannot directly observe the number of mutations that occurred between species either. We can only observe the differences between observed sequences. Mutations that occur on lineages that are eventually lost in a population because they leave no present day offspring cannot observed. Only those that survive to be observed in the genomic sequences we can observe. Mutations that spread to the entire species we say gets fixed in the species, and we call such mutations substitutions. When comparing genomic sequences from different species, we mainly observe such fixed mutations unless the species are very closely related and polymorphism in the ancestral species has not been fixed within the decedent species.
For neutrally evolving sequences, sequences not under selection, the number of substitutions is equal to the number of mutations [2]. That is, the number of substitutions that are fixed within a species through the population genetics process are equal to the number of mutations that occur within the species. For species such as primates, we expect most of the genome to be evolving neutrally, since the genomes of these species consists mainly of “junk” DNA that is unlikely to be under selection.
Assuming that the sequences are mainly evolving neutrally, and assuming that mutations occur at a regular rate, we can estimate the number of mutations that occurred between two species using so-called substitution models, that compensate for recurring mutations, mutations at the same genomic site, and translates the number of observed differences between two sequences into expected number of mutations that occurred.
Since mutations enter the sequences through a chemical/physical process, the assumption of a regular rate is not far fetched, and in general there is a close correlation between divergence of species from fossil evidence and the number of mutations estimated from the substitution models. The rate of substitutions does seem to vary somewhat between divergent species groups, with a slow-down in apes compared to old world monkeys and with slight variations even within different primate groups [3]. Within a group of closely related species, however, such as the great apes, the evidence generally seems to justify the molecular clock assumption [3].
There is one important caveat, however: We might be able to estimate the number of mutations that occurred but if we do not know the rate in which new mutations occur we cannot translate the number of mutations into years of divergence.
To translate the number of mutations that occurred in a time interval into the number of years of the time interval, we need need to know either the rate with which mutations occur, or how long the time interval was. We to calibrate the molecular clock.
The approach typically taken is to have a calibration point, a point in time where we are reasonably sure we know the divergence time of two sequences in years, and use the number of mutations between the two sequences to give us the of mutations per of years.
If we pick a point far enough back in time, the relative difference between the sequence divergence and the species divergence will be small. The difference between the two will be 2Ne generations which might be a difference of hundreds of thousands of years; relatively little if the species divergence is in millions of years.
Of course, we cannot go far enough back in time that the mutation rate has changed, so there is a trade-off between the relative difference in sequence distance and species difference and how conserved the mutation rate is.
For dating the evolution of great apes, one calibration point is the divergence between old world monkeys and apes (catarrhines; lesser apes and greater apes). Based on fossil evidence we expect the split to be between ~20 million years ago (mya) and ~30 mya [1]. That is, we have fossils indicating that the split had occurred ~20 mya and fossils that are believed to be older than the split at ~30 mya.
Only the lower bound of this informs us of the split time, however. The lack of fossil evidence is not evidence that the split occurred later than ~30 mya. Absence of evidence, after all, is not evidence of absence.
Still, it gives us a tentative calibration point, with a relative uncertainty of ~30% of the divergence of the two groups of species.
The genetic (sequence) divergence between two genomes is an upper bound of the species divergence, but a consequence of the calibration problem, genetic estimates of divergence can turn out to be underestimating the speciation times.
If the calibration point underestimates the number of years between the species split, the number of mutations per year will also be underestimated. Consequently, the genetic estimates, while over-estimating the species split in number of mutations, will underestimate the the years separating genomes [1].
This has consequences for our inference about the evolution of great apes and the relationship between humans and our ancestors. Calibrating the molecular clock based on an old world monkey / ape divergence of ~25 mya ago, a time point in the middle of the expected divergence time, will put fossils such as Ardipithecus, Orrorin and Sahelanthropus further back in time than the split between human and chimpanzee, while a calibration point based on a ~30 mya divergence of old world monkeys and apes would put the same fossils after the split between human and chimpanzee; potentially on the lineage leading to humans [1,4].
Conversely, assuming that Sahelanthropus is on the human-specific lineage puts the human-chimpanzee split in the range of 6-7 mya. Using this as a calibration point, the ape / old world monkey divergence is estimated ~27 mya for the lower end of the calibration interval and ~36 mya for the upper range of the calibration interval [3].
Incorrect calibration of the molecular clock can thus turn, what should be an upper bound into under estimates of the sequence divergence, when measured in years rather than number of mutations. An underestimate of an upper bound tells us next to nothing about the true value, unless we have some grasp of how tight the bounds are, but unfortunately this is the best knowledge we currently have about the divergence time of species.
Our best approach to alleviating this problem is working out the uncertainties in the upper and lower bounds, and that way discarding extreme consequences of the calibration.
From population genetics theory we can make inference about the relative over-estimation caused by the sequence divergence within the coalescence process and disentangle the species divergence from the sequence divergence [5,6]. From this we can tighten the intervals consistent with the fossil record.
Here’s two very interesting posts on ancient admixture / introgression between Homo sapience and ancestral Homos by Razib:
I’m still not completely convinced by the various studies showing evidence for this; many of them are heavily model based and there are just so many possible artifacts here. Speaking from experience. Still, it isn’t that far fetched that the simple out of African model is too simple to be correct.
I had actually missed the PLoS ONE paper discussed in the second of the posts, but skimmed through it quickly today. It looks very interesting, but I want to get my head around the model used. Hopefully I can find time for that early next week.
I really look forward to reading the Neandertal paper and see what it has to say about geneflow between us and Neandertals. A few month ago, while I visited his group in Leipzig, Svante Pääbo actually promised to show me the draft, but it never happened. In Ohio in February I talked to one of the authors on the paper and he wouldn’t reveal anything… I guess I just have to wait and can only hope that it won’t be too long.
Update: See also John Hawks Population models and testing human origins