Archive for the ‘Paper reviews’ Category

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Saturday, September 26th, 2009

I read this paper on our way back from Leipzig and then again today to see if I missed anything in the first read through (I was pretty tired at the time).

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Wall, Lohmueller and Plagnol, Mol Biol Evo 26(8):1823-1827

We analyze patterns of genetic variation in extant human polymorphism data from the National Institute of Environmental Health Sciences single nucleotide polymorphism project to estimate human demographic parameters. We update our previous work by considering a larger data set (more genes and more populations) and by explicitly estimating the amount of putative admixture between modern humans and archaic human groups (e.g., Neandertals, Homo erectus, and Homo floresiensis). We find evidence for this ancient admixture in European, East Asian, and West African samples, suggesting that admixture between diverged hominin groups may be a general feature of recent human evolution.

What they do in this paper is to fit a two population coalescent model, with expansion, migration, bottlenecks and the works, to both an African+European and an African+Asian data set, then use this fitted model as a null model of the genetics of the populations.  They then 1) do a test on an LD statistic against this null model, taking rejections of this null model as evidence for admixture from archaic humans, and 2) fit an admixture extension of the model to estimate the level of admixture.  They find evidence for admixture with archaic humans for both data sets, with a somewhat higher degree in the Europeans.

I’m a bit underwhelmed by the paper, I must admit.  I’m not saying that there is no admixture with archaic humans, but this approach does not convince me.

Even when taking various demographic effects into account in the modeling, the null model is unlikely to exactly fit real data.  Taking deviations from the null model as any kind of evidence for admixture thus seems a bit hasty.

Not that I have any better ideas as to how to approach this, just, in my eyes the jury is still out on the question of admixture with archaic humans…


Wall, J., Lohmueller, K., & Plagnol, V. (2009). Detecting Ancient Admixture and Estimating Demographic Parameters in Multiple Human Populations Molecular Biology and Evolution, 26 (8), 1823-1827 DOI: 10.1093/molbev/msp096
269-303=-34

HMMoC and HMMConverter

Friday, September 18th, 2009

I just want to say a few words about a short paper I read last week, and a paper that is a few years old now but related to it.

The first is out in advanced access in Nucleic Acids Research:

HMMConverter 1.0: a toolbox for hidden Markov models

Lam and Meyer

Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model’s architecture. The implementation of prediction algorithms and algorithms to train the model’s parameters, however, can be a time-consuming and error-prone task. We here present HMMCONVERTER, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMsand pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMCONVERTER canthus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.

the other was published in Bioinformatics in 2007:

HMMoC – a compiler for hidden Markov models

Lunter

Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Both papers describe compilers that generate C++ implementations of hidden Markov model algorithms from XML specifications, and really they are very similar.

The basic HMM algorithms are quite straightforward to implement, but if you want more complex models such as pair-HMMs or generalized HMMs there is a tad more complications to deal with, and if you need to optimize the algorithms in either runtime or memory usage there are some more complex algorithms you can use such as “banding” – implemented in both HMMoC and HMMConverter – that risk giving sub-optimal results but at a much reduced running time and memory consumption, or the Hirschberg algorithm – only implemented in HMMConverter as far as I can see – that exchanges a doubling in running time for a much reduced memory consumption.

Implementing such extra algorithms is not conceptually hard, but can be quite tedious and error prone, so it makes good sense to have code generators building the algorithms for you.  That is exactly what these tools do.

At a bird’s eye view, the tools are very similar.  You specify the HMM in an XML file (a specification language that I personally don’t like that much, but that is of course very subjective) and the tools then generate the algorithms you ask them to, output as C++ code.

HMMoC provides a number of handles for you to add your own C++ code to the generated code; I am not sure if HMMConverter does the same, but on the other hand HMMConverter provides handles for various constraints on the parameters so it might be easier to re-parameterize models made with that.

Another cool feature unique to HMMConverter is priors on sequence annotation.  You can provide an annotation to the input sequence(s) that is then incorporated in the emission probabilities.  The prior is really on hidden states, but incorporating them into the emission probabilities has exactly the effect you want from them: they weight the posterior probabilities of the hidden states along the input.

To deal with numerical issues, HMMConverter works in log-space while HMMoC uses something called “extended-exponent real numbers”.  Working in log-space can be really slow for the Forward and Backward algorithms, since you have to switch in and out of log-space to deal with sums of probabilities (the Viterbi algorithm doesn’t have this problem, so there the log-space solution is pretty fast).

Unfortunately, there isn’t any comparison between the execution times of algorithms generated with the two tools in the new paper, so I don’t know how much this matters.  In the HMM library I am developing with Andreas we found that the log-solution was very slow, though, and therefore we use a re-scaling approach instead.

I would love to see a comparison of the runtime efficiency between the approaches, but just not quite enough to go and do it myself right now…

  • Lam, T., & Meyer, I. (2009). HMMCONVERTER 1.0: a toolbox for hidden Markov models Nucleic Acids Research DOI: 10.1093/nar/gkp662
  • Lunter, G. (2007). HMMoC a compiler for hidden Markov models Bioinformatics, 23 (18), 2485-2487 DOI: 10.1093/bioinformatics/btm350

261-289=-28

Independent mammalian genome contractions following the KT boundary

Wednesday, September 2nd, 2009

Tomorrow it is my turn to present a paper at our genome evolution journal club at BiRC, and I have picked this one:

Independent mammalian genome contractions following the KT boundary

Mina Rho et al. Genome Biology and Evolution, 2009

Abstract

Although it is generally accepted that major changes in the earth’s history are significant drivers of phylogenetic diversification and extinction, such episodes may also have long-lasting effects on genomic architecture. Here we show that widespread reductions in genome size have occurred in multiple lineages of mammals subsequent to the Cretaceous–Tertiary (KT) boundary, whereas there is no evidence for such changes in other vertebrate, invertebrate, or land plant lineages. Although the mechanisms remain unclear, such shifts in mammalian genome evolution may be a consequence of an increase in the efficiency of selection against excess DNA resulting from post-KT population size expansions. Independent historical changes in genome architecture in diverse lineages raise a significant challenge to the idea that genome size is finely tuned to achieve adaptive phenotypic modifications and suggest that attempts to use phylogenetic analysis to infer ancestral genome sizes may be problematical.

We have previously read Michael Lynch’s book on genome architecture and evolution and this paper reads a lot like that book in general theme.

Anyway, the paper looks at the age distribution of LTR repetitive elements.  These are transposable elements in the genome where when they are inserted they have two long terminal repeat (LTR) strings that are identical.  These two identical sequences diverge via mutations over time, and from the divergence between the two you can date the age of the insertion.

If the elements are inserted with a fixed rate B and disappear again with another fixed rate D, we can model this age distribution as a simple birth/death process and the number of elements at time t is given by N_t = B \exp(-Dt).  For several species this fits quite nicely:

but for mammals there is a strange “bulge” after the KT boundary indicating that either the birth rate has dropped recently or that the death rate has increased:

Since this bulge is after the divergence of these lineages, this change in the process must have occurred independently in all these mammals.

The hypothesis for what has happened given in the paper is this:  After the extinction of the dinosaurs the mammals have generally increased in numbers in all lineages with a resulting increase in effective population size.  What happens when the effective population size goes up is that selection becomes more efficient compared to genetic drift, so assuming that these elements are slightly deleterious, we would expect that fewer of them gets fixed and more of them gets removed as the effective population size goes up.

That explanation is of course not proven by the data, but it does fit the pattern observed.

In any case, it is clear that we have experienced a decrease in the recent insertions compared to older elements, which means that unless something else is now taking up the space our genomes are shrinking.

Don’t worry too much about that, though, it is the junk that is disappearing.

Rho, M., Zhou, M., Gao, X., Kim, S., Tang, H., & Lynch, M. (2009). Independent Mammalian Genome Contractions Following the KT Boundary Genome Biology and Evolution, 2009, 2-12 DOI: 10.1093/gbe/evp007
245-253=-8

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Wednesday, August 26th, 2009

A few days ago I wrote about the hypothesis of complex speciation between humans and chimps, and today I’ll briefly discuss another paper on the human / chimp speciation:

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Matthew T. Webster, Gene 443 70-75, 2009

Abstract

There is a large variation in divergence times across genomic regions between human and chimpanzee. It has been suggested that this could partly result from selection against ancestral gene flow between incipient species in regions of the genome containing genetic incompatibilities. It is possible that such barriers to gene flow could arise in specific genes or in chromosomal inversions. I analysed patterns of lineage sorting that occur between human, chimpanzee and gorilla genomic sequences by examining divergent site patterns in > 18 Mb genomic alignments. I develop a method to normalise site patterns by the mutational spectrum to minimise errors caused by misinference caused by recurrent mutation. Here I show that divergence times appear to be uniform between coding and noncoding sequences and between inverted and non-rearranged portions of chromosomes. I therefore find no evidence to support the large-scale accumulation of genetic incompatibilities at speciation genes or chromosomal inversions in the ancestral population of humans and chimpanzees. In addition, site patterns that are discordant with the species tree occur more frequently in regions with high human recombination rates. This could indicate the action of selective sweeps in the ancestral population, but could also be indicative of increased rates of homoplasy in these regions. I argue that these observations are compatible with a neutral allopatric model of speciation.

Models of speciation

Speciation happens when gene flow stops between one group of a species and another (and doesn’t start again later or we get something like the hybridization scenario I wrote about in my earlier post).

There are different ways this can happen.  For instance, one group might somehow find itself geographically isolated from the other – e.g. find themselves on the other side of a large river – effectively isolating the group from the rest of the species.  This is know as allopatric speciation (or depending on exactly how this plays out, peripatric speciation).

In this scenario, the speciation happens at the time where the groups are isolated.  From that point and onwards the groups are essentially different species, since gene flow has stopped.  It will take some time before the groups are incapable if inter-breeding, but unless they actually merge again at some time before then, the time of the speciation event is the time the groups get separated.

That doesn’t mean that the genomic divergence time between the two species matches the time back to the speciation event.  Some individuals in one of the groups might be closer related to individuals in the second group than the other individuals in the first group for a few generations.  So the genetic distance between the two species is a bit larger than the “species distance”.  Add in recombination and the picture gets a bit more complex.

Still, we can talk about a specific point in time where the speciation time occurred and we have a mathematical model – the coalescent model – of the genome distance between the two species that depends on this time and the population genetics in the ancestral species before then.

The speciation can also be caused by “genetic isolation”.

If a new mutation enters the group, where homozygotes for either the wildtype or the mutants are fitter than the heterozygotes, then the group will tend to split into two.  The mutants and the wildtypes.

Without recombination, there wouldn’t be much difference in the genomic distance between the two resulting species.  The heterozygotes would be selected against and the two homozygotes would diverge.

With recombination, again the situation gets a bit more complicated.  The heterozygotes would still be selected against, but assuming heterozygoes still manage to mate from time to time, you would get homozygote offsprings of heterozygoes who are just as fit as other homozygotes.

Because there is selection against heterozygoes you will tend to split the species into two – the two homozygoes – but the divergence will be deeper at the locus of the mutation than it will in the rest of the genome.

We call such a locus a “speciation gene” and candidates for such genes are functional genes (where we expect some selection) or structural variations such as inversions.

Back to the paper…

What Webster looks at in this paper is the patterns of divergence – especially deep coalescence events with incomplete lineage sorting where we observe sites grouping human and gorilla or chimp and gorilla – in the genome.

He then looks at these patterns in genes, introns, inversions … the candiates for speciation genes, to see if these looks like they are more divergent than the rest of the genome.  If so, then the speciation between humans and chimps could be caused by speciation genes.  If not, then the speciation could be allopatric (the same “species divergence” throughout the genome, but of course not the exact same sequence divergence since the coalescence times will still vary along the genome).

Long story short, he doesn’t find any evidence for deeper divergence these places so we cannot rule out an allopatric speciation here.

He does find a correlation between recombination rate and deep divergence, which can be explained by either increased mutability in regions of high recombination or selective sweeps in the ancestral species.  The latter is much more interesting, really, but we cannot rule out the first explanation so I won’t comment much on this here…

Critisism

I do have a slight problem with the analysis in the paper, though.

It seems to me that by just looking at differences in divergence time between genes and the rest of the genome – or between inversions and the rest of the genome or whatnot – is not particularly powerful for detecting speciation genes.

When comparing general groups like this, it seems to me that a few speciation genes would simply be drowned out by the larger number of “plain old genes”.  So all the analysis is really saying is that there isn’t a large number of speciation genes between humans and chimps, not that there are none.

The paper doesn’t claim any more than this either, but it would be interesting to work out just how large a fraction of the genes would have to be speciation genes – and how large a difference between the divergence of speciation genes and the rest of the genome there has to be – to be able to distinguish between the two scenaria with this analysis.

I haven’t done the math yet, but I plan to when I get the time…


Webster, M. (2009). Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation Gene, 443 (1-2), 70-75 DOI: 10.1016/j.gene.2009.05.006
238-243=-5

Doubts about complex speciation between humans and chimpanzees

Wednesday, August 19th, 2009

I read this paper in bed yesterday before going to sleep:

Doubts about complex speciation between humans and chimpanzees

Presgraves and Yi, Trends in Ecology & Evolution 2009

Abstract

Two patterns from large-scale DNA sequence data have been put forward as evidence that speciation between humans and chimpanzees was complex, involving hybridization and strong selection. First, divergence between humans and chimpanzees varies considerably across the autosomes. Second, divergence between humans and chimpanzees (but not gorillas) is markedly lower on the X chromosome. Here, we describe how simple speciation and neutral molecular evolution explain both patterns. In particular, the wide range in autosomal divergence is consistent with stochastic variation in coalescence times in the ancestral population; and the lower human–chimpanzee divergence on the X chromosome is consistent with species differences in the strength of male-biased mutation caused by differences in mating system. We also highlight two further patterns of divergence that are problematic for the complex speciation model. Our conclusions raise doubts about complex speciation between humans and chimpanzees.

Complex speciation between humans and chimpanzees

You might remember the Patterson et al. paper in Nature back in 2006, that argued for a complex speciation of humans and chimps: An early separation between the two, followed by a hybridization and then the extinction of one of the species ancestral to the hybrids.

The arguments for this theory were 1) large variation in divergence time along the autosomal chromosomes and 2) a much more recent divergence of the X chromosome compared to the autosomes.

Wakeley then argued that 1) at least didn’t need any complex speciation history.  The variation in divergence is actually as would be expected just from variation in coalescence times along the chromosomes, assuming a reasonably large effective population size of the human/chimp ancestor species.

As for 2), the coalescence process alone cannot explain the recent divergence of X chromosomes.  We do expect a more recent divergence of X chromosomes than autosomes, since the effective population size of X chromosomes is 3/4 of that of the autosomal chromosomes, but the divergence of the X chromosomes is less than what can be explained by this.

This could either be explained by selection on the X chromosome (which essentially reduces the effective population size and thus leads to a reduced divergence) or by the difference in mutation rate between males and females that would affect the X chromosome differently than the autosomes (reducing the difference between the two).

It is well known that there is a bias in mutation rate between males and females, having to do with the average number of genome replications per generation in males and females, respectively.  The details I won’t go into here (although they are pretty important for the post, the post would just get too long and I don’t want to loose the readers who already know this … I might write about it in a separate post another day…)

Anyway…

Selection is probably not likely.  It would require a pretty uniform selection across the X chromosome.  The male-biased mutation explanation sounds more reasonable.

A problem with both explanation, though – Patterson et al. argued in their reply – is that this weird pattern in X is only observed between human and chimp and not between human and gorilla (or chimp and gorilla).

If mutation-rate differences alone could explain the observed data, we would expect a consistent value for alpha from the human–chimpanzee and human–gorilla divergence data, but estimates of alpha are significantly different (P = 0.001). A high value of alpha also cannot explain other important features in Table 1: the near-absence of sites on chromosome X that cluster humans and gorillas or chimpanzees and gorillas; or why human–gorilla divergence should not be reduced on chromosome X (such a reduction would be expected if high male mutation rate were responsible for low human–chimpanzee genetic divergence on chromosome X).

Lineage specific male biased mutation rate

The Presgraves and Yi paper argues that male biased mutation rate can explain the pattern after all.

True, the low divergence on X is only observed between humans and chimps and not between humans and gorillas, but if the strength of this bias is larger on the human and chimp lineages than on the gorilla lineage it could still be an explanation.

Chimps are very promiscuous, humans somewhat less so, while gorillas are polygynous.  This affects sperm production so chimps produce most sperm per ejaculation, gorillas the least and humans again inbetween.

With more sperm produced in humans and chimps than in gorillas, it is therefore conceivable that the mutation bias is stronger in chimps and humans than in gorillas.

So they estimate this bias per lineage and get exactly that result: the bias is strongest in chimps, intermediate in humans and weakest in gorillas:

With different male-biased mutation rate in the lineages, with much less bias in gorillas, there is nothing strange in a reduced divergence on X chromosomes between humans and chimps than between humans and gorillas.

Voilà!  No more need for a complex speciation history!

At least until the next paper…

  1. Presgraves, D., & Yi, S. (2009). Doubts about complex speciation between humans and chimpanzees Trends in Ecology & Evolution DOI: 10.1016/j.tree.2009.04.007
  2. Patterson N, Richter DJ, Gnerre S, Lander ES, & Reich D (2006). Genetic evidence for complex speciation of humans and chimpanzees. Nature, 441 (7097), 1103-8 PMID: 16710306
  3. Wakeley J (2008). Complex speciation of humans and chimpanzees. Nature, 452 (7184) PMID: 18337768
  4. Patterson, N., Richter, D., Gnerre, S., Lander, E., & Reich, D. (2008). Patterson et al. reply Nature, 452 (7184) DOI: 10.1038/nature06806

231-236=-5