Archive for the ‘Paper reviews’ Category

Phylogenomics of primates and their ancestral populations

Tuesday, November 17th, 2009

If you are interested in phylogenomics and primate evolution — including human evolution — this new review in Genome Research is a must read.

Phylogenomics of primates and their ancestral populations

Adam Siepel

Genome assemblies are now available for nine primate species, and large-scale sequencing projects are underway or approved for six others. An explicitly evolutionary and phylogenetic approach to comparative genomics, called phylogenomics, will be essential in unlocking the valuable information about evolutionary history and genomic function that is contained within these genomes. However, most phylogenomic analyses so far have ignored the effects of variation in ancestral populations on patterns of sequence divergence. These effects can be pronounced in the primates, owing to large ancestral effective population sizes relative to the intervals between speciation events. In particular, local genealogies can vary considerably across loci, which can produce biases and diminished power in many phylogenomic analyses of interest, including phylogeny reconstruction, the identification of functional elements, and the detection of natural selection. At the same time, this variation in genealogies can be exploited to gain insight into the nature of ancestral populations. In this Perspective, I explore this area of intersection between phylogenetics and population genetics, and its implications for primate phylogenomics. I begin by “lifting the hood” on the conventional tree-like representation of the phylogenetic relationships between species, to expose the population-genetic processes that operate along its branches. Next, I briefly review an emerging literature that makes use of the complex relationships among coalescence, recombination, and speciation to produce inferences about evolutionary histories, ancestral populations, and natural selection. Finally, I discuss remaining challenges and future prospects at this nexus of phylogenetics, population genetics, and genomics.

…and if you are wondering why my blog is so quiet these days, it is because I am swamped with four of the genome projects mentioned in the paper: orangutan, bonobo, gorilla and macaque…

Any summary of this paper that I write will not really do justice to it — you really should read it yourself and you will be happy you did — so I’ll just briefly summarize the topics that Adam covers.

First he covers basic phylogenetics, that is figuring out species relationships.  This is, by now, a well known field and essentially boils down to modeling sequence evolution as Markov chains so you can estimate divergence times and tree relationships from the substitutions between sequences.

For closely related species, though, that is only a small part of the picture, and the more interesting part of the paper involves introducing population genetics to phylogenetics.  You have to remember that speciation somehow involves populations; two species do not just split up, rather groups of individuals diverge and their genomes start diverging as groups rather than individuals.  That leads to varying sequence divergence as you scan along the genomes, and under certain conditions to incomplete lineage sorting, where gene trees are different from species trees.

This doesn’t just cause complications in genomic inference, though.  It provides valuable information about ancestral species and about speciation processes, which is the next topic Adam covers.  For primates, this is especially important.  The time intervals between speciations are short, and the ancestral effective population sizes are large *, so 1) if you ignore this your results will be way off, but 2) if you embrace it you have a lot of information to learn about the ancestry of the primates.

This then leads us to speciation models.  There are plenty of those, where the simplest (allopatric speciation) just assumes that some barrier appears between two populations after which they evolve independently to the point where they can no longer reproduce as hybrids.  That is probably a good model for the chimp/bonobo split, where the Congo River got in the way (chimps can’t swim), but it is a bit simple so more complex scenarios are worth considering for most speciation events.  The point here just is that different scenarios will leave different signals in the genomes, and we should be able to work this out by looking at the extant genomes.

There’s a nice review of the work done so far in the paper, but honestly we are still only at the starting phase of modeling this, and a lot of work remains before we can say anything conclusively about any of the primate speciations.

Next we get to selection.  With the whole neutral theory we have turned to believe that we can explain most of genome evolution with neutral mutations — well I have anyway, but that might just be me.  Recent results, though, hints at selection being a major force in genome evolution anyway. My older colleagues tells me that selection was much more important in theory years back, but my background gave me the intuition that it could pretty much be ignored when comparing genomes; maybe I was wrong on that.

Perhaps the null model when we look at entire genomes shouldn’t be neutrality after all, I don’t know… We are seeing signals to that effect in our own work, anyway, but I’ll tell you all about that later when those papers are out, for now let’s just read Adam’s paper that is much more interesting anyway!

The last part of the paper is on Future Prospects.  Well, most papers are, so no surprise there, but if you are getting into the field there are some interesting areas to start thinking about in this review.

How do we incorporate the ancestral recombination graph (ARG) into phylogenetic analysis?  How do we model it without the combinatorial state space explosion?  How do we infer anything usable from the weak signals that is in the data for this? How do we combine model sophistication with computational efficiency to alleviate the state space explosion? Which model assumptions are essential and which can we get away with approximating?

Let me add a few of my own: How do we model this complex system without too much complex math so that when we have results we can actually interpret the results?  How do we check if deviations from our model actually shows evidence for some model over another, and are not just showing that we have the wrong model?

Go read the paper!  Seriously, it is a great read!

* Yeah, about ancestral population sizes… there are consistent estimates of very large ancestral effective population sizes, using very different methods, but generally it seems like the ancestral species were more diverge than the extant species are.  The consistent results, with different methods, indicates that this might be true, but it still is somewhat suspicious, but I guess we will learn more over the coming years as we get more data and more sophisticated methods.


Siepel, A. (2009). Phylogenomics of primates and their ancestral populations Genome Research, 19 (11), 1929-1941 DOI: 10.1101/gr.084228.108

321-327=-6

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Wednesday, September 30th, 2009

Two of my main interests are hidden Markov models and selection.  A paper from this spring, in Genetics, combines the two:

Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models

Boitard, Schlötterer and Futschik

Detecting and localizing selective sweeps on the basis of SNP data has recently received considerable attention. Here we introduce the use of hidden Markov models (HMMs) for the detection of selective sweeps in DNA sequences. Like previously published methods, our HMMs use the site frequency spectrum, and the spatial pattern of diversity along the sequence, to identify selection. In contrast to earlier approaches, our HMMs explicitly model the correlation structure between linked sites. The detection power of our methods, and their accuracy for estimating the selected site location, is similar to that of competing methods for constant size populations. In the case of population bottlenecks, however, our methods frequently showed fewer false positives.

Selective sweeps

Under a simple Wright-Fisher model, a neutral mutation that is just introduced into a population  can slowly increase and decrease in frequency until it is eventually either fixed in the population, which happens with probability \frac{1}{2N_e}, or until it is lost from the population againg, which happens with probability 1-\frac{1}{2N_2} of course.

The expected time from such a mutation is introduced into the population and until it is fixed, if it is lucky to be fixed, is 2N_2 generations.  During this time, the descendant chromosomes of the original mutant chromosome will be subjected to new mutations and to recombinations.

Once this mutation is fixed, everyone in the population will of course share that particular mutation (ignoring back-mutations and such here), but because of recombination nearby sites will not necessarily all be derived from the original mutation chromosome.  Close to the mutation site — where few recombinations will have broken up the sequence — most chromosomes will be derived from the mutation chromosome and as we move away from the mutation site fewer chromosomes will be derived from that original chromosome.

Now, if the mutation introduced has a selective advantage, essentially the same process will play out.  In each generation there is a slightly higher chance that this mutation will have off-springs, but that is essentially the only difference.

What this means is that initially there is still a very good chance that the mutation will be lost — even with slightly better odds accidents do happen — but once the mutation has reached a reasonable frequency it is almost guaranteed to reach fixation — unless a lot of accidents happen.

Once the frequency of the site under selection is high enough it will very quickly reach fixation.  The expected time it takes depends on the selection strength but unless the selective advantage is very small it will reach fixation a lot faster than if it was neutral.  Think logarithmic time in the size of the population compared to linear time.

Since it reaches fixation much faster than a neutral mutation, fewer mutations and fewer recombinations will have time to occur, so a much wider region around the mutation site will be shared by all descendant chromosomes.  Combined, this means that for a selected site you expect a wide region with a more recent shared ancestor than you would expect at a neutral site, a phenomena called a selective sweep.

Site frequency spectra

Now, from the population genetics model you can work out — putting your thinking hat on or just simulate — the expected distribution of derived and ancestral alleles: the site frequency spectrum.  This will be different from neutral alleles and selected alleles because of the shorter time back to the common ancestor for the selected sites.  The shorter site means that there is a general reduction in polymorphism near a selected site, and derived alleles that appeared on chromosomes with the beneficial mutation will be at a higher frequency than they would be if they weren’t “hitchhiking” on the selection of the beneficial mutation.

The pattern is a bit complicated by recombination, since you need to take into account that the further away from the selected site you look, the weaker the hitchhiking effect will be; a new mutation can only hitchhike as long as it is linked to the selected site, and recombinations break that link.

Anyway, the different spectra of derived and ancestral alleles can be used to detect selective sweeps.  Two methods that exploit this, that is relevant for this post, are Kim and Stephan (2002) and Nielsen et al. (2005).

Of course, selection is not the only thing that can mess up the site frequency spectrum and make it different from the expected neutral distribution.  Demographic effects like expending populations and bottlenecks can look very similar to selection effects, so we cannot absolutely rule out neutrality if we see a deviation from the expected spectrum.  Still, the site frequency spectra of neutrality versus selection can be used for scanning for selection.

Detecting sweeps in a hidden Markov model

The new result in the Genetics paper is a hidden Markov model that uses site frequency spectra to scan for selective sweeps.

Using an HMM means that the model can capture spatial patterns along a genome and capture transitions from “neutral” regions — where no sweep has occurred or is occurring — from “selected” regions — where a sweep occurred or is occurring.  So you don’t have to assume that a locus you are looking at is either a neutral region or a selected region and you don’t have to fiddle around with sliding windows to scan a genome, you explicitly capture the changing patters.

One of the nice properties of HMMs for genomic scans and the reason I love them so much.

The model Boitard et al. develop is quite simple.  They have three states: a neutral state, a selected state, and an intermediate used to capture sites that are slightly caught up in the hitchhiking but not close enough to a selected site to get the full effect.

The transition matrix has a single parameter, p, that is the probability that a neutral or selected site switches to the intermediate state (and the intermediate state switches to those two with equal probability set to p/2).

T=\begin{pmatrix}1-p&p&0\\ p/2&1-p&p/2\\ 0&p&1-p\end{pmatrix}

This of course has the unfortunate effect that the prior distribution (stationary distribution) of the chain will give you 25% chance of a site being neutral, 25% chance of it being selected and 50% chance of being intermediate, which doesn’t really match my expectation of the amount of selection in, say, a human genome. Also, the (prior) expected length of a sweeped region is the same as a neutral region which also does not match my intuition.  With enough data, though, the likelihood should overrule the prior so perhaps it is not too much of a worry…

The emissions of the model are frequencies of derived alleles, so for each site it will emit a frequency that depends on the state.  This is where they capture the different expected frequencies depending on whether a site is neutral or selected.

They use the Kim and Stephan’s and Nielsen et al. methods for this, to develop three variations of HMMs: HMMA, using Kim and Stephan, HMMB using Nielsen et al. and HMMB-SEQ, that also uses Nielsen et al. but only considers segregating sites.  The latter is only for comparison purposes and of course ignores a lot of the information in the data, since the amount of non-segregating sites reflects the general level of polymorphism in a region which again is dependent on the depth of the local genealogy and will be affected by selection.

They use simulations under neutrality to fix the parameter p so they get a 5% false positive rate, and then use the models to scan for sweeps.

They get an okay power for detecting sweeps, but compared to the previous methods they don’t get that much since they did pretty good as well:

Table 1Where they refer to this table in the paper they say they have a higher power, but compared to the CLsw column, the Kim and Stephan’s method, they do not.  After all, it is difficult to beat a power of 1.

They do, however, appear to be more robust to bottlenecks where the two other methods have very high false positive rates:

Table 5


Boitard, S., Schlotterer, C., & Futschik, A. (2009). Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models Genetics, 181 (4), 1567-1578 DOI: 10.1534/genetics.108.100032
273-307=-34

Not exactly an impressive success rate…

Saturday, September 26th, 2009

From my own experience I know that it can be hard to get access to data that you would really love to analyse, but I didn’t expect it to be quite this bad, even for data that is required to be available by the journals where the papers describing the data are published:

Empirical study of data sharing by authors publishing in PLoS journals

Savage and Vickers, PLoS ONE 2009

Background

Many journals now require authors share their data with other investigators, either by depositing the data in a public repository or making it freely available upon request. These policies are explicit, but remain largely untested. We sought to determine how well authors comply with such policies by requesting data from authors who had published in one of two journals with clear data sharing policies.

Methods and Findings

We requested data from ten investigators who had published in either PLoS Medicine or PLoS Clinical Trials. All responses were carefully documented. In the event that we were refused data, we reminded authors of the journal’s data sharing guidelines. If we did not receive a response to our initial request, a second request was made. Following the ten requests for raw data, three investigators did not respond, four authors responded and refused to share their data, two email addresses were no longer valid, and one author requested further details. A reminder of PLoS’s explicit requirement that authors share data did not change the reply from the four authors who initially refused. Only one author sent an original data set.

Conclusions

We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

Getting a 10% success rate, when it should be 100% is pretty bad…

269-304=-35

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Saturday, September 26th, 2009

I read this paper on our way back from Leipzig and then again today to see if I missed anything in the first read through (I was pretty tired at the time).

Detecting ancient admixture and estimating demographic parameters in multiple human populations

Wall, Lohmueller and Plagnol, Mol Biol Evo 26(8):1823-1827

We analyze patterns of genetic variation in extant human polymorphism data from the National Institute of Environmental Health Sciences single nucleotide polymorphism project to estimate human demographic parameters. We update our previous work by considering a larger data set (more genes and more populations) and by explicitly estimating the amount of putative admixture between modern humans and archaic human groups (e.g., Neandertals, Homo erectus, and Homo floresiensis). We find evidence for this ancient admixture in European, East Asian, and West African samples, suggesting that admixture between diverged hominin groups may be a general feature of recent human evolution.

What they do in this paper is to fit a two population coalescent model, with expansion, migration, bottlenecks and the works, to both an African+European and an African+Asian data set, then use this fitted model as a null model of the genetics of the populations.  They then 1) do a test on an LD statistic against this null model, taking rejections of this null model as evidence for admixture from archaic humans, and 2) fit an admixture extension of the model to estimate the level of admixture.  They find evidence for admixture with archaic humans for both data sets, with a somewhat higher degree in the Europeans.

I’m a bit underwhelmed by the paper, I must admit.  I’m not saying that there is no admixture with archaic humans, but this approach does not convince me.

Even when taking various demographic effects into account in the modeling, the null model is unlikely to exactly fit real data.  Taking deviations from the null model as any kind of evidence for admixture thus seems a bit hasty.

Not that I have any better ideas as to how to approach this, just, in my eyes the jury is still out on the question of admixture with archaic humans…


Wall, J., Lohmueller, K., & Plagnol, V. (2009). Detecting Ancient Admixture and Estimating Demographic Parameters in Multiple Human Populations Molecular Biology and Evolution, 26 (8), 1823-1827 DOI: 10.1093/molbev/msp096
269-303=-34

HMMoC and HMMConverter

Friday, September 18th, 2009

I just want to say a few words about a short paper I read last week, and a paper that is a few years old now but related to it.

The first is out in advanced access in Nucleic Acids Research:

HMMConverter 1.0: a toolbox for hidden Markov models

Lam and Meyer

Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model’s architecture. The implementation of prediction algorithms and algorithms to train the model’s parameters, however, can be a time-consuming and error-prone task. We here present HMMCONVERTER, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMsand pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMCONVERTER canthus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.

the other was published in Bioinformatics in 2007:

HMMoC – a compiler for hidden Markov models

Lunter

Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Both papers describe compilers that generate C++ implementations of hidden Markov model algorithms from XML specifications, and really they are very similar.

The basic HMM algorithms are quite straightforward to implement, but if you want more complex models such as pair-HMMs or generalized HMMs there is a tad more complications to deal with, and if you need to optimize the algorithms in either runtime or memory usage there are some more complex algorithms you can use such as “banding” – implemented in both HMMoC and HMMConverter – that risk giving sub-optimal results but at a much reduced running time and memory consumption, or the Hirschberg algorithm – only implemented in HMMConverter as far as I can see – that exchanges a doubling in running time for a much reduced memory consumption.

Implementing such extra algorithms is not conceptually hard, but can be quite tedious and error prone, so it makes good sense to have code generators building the algorithms for you.  That is exactly what these tools do.

At a bird’s eye view, the tools are very similar.  You specify the HMM in an XML file (a specification language that I personally don’t like that much, but that is of course very subjective) and the tools then generate the algorithms you ask them to, output as C++ code.

HMMoC provides a number of handles for you to add your own C++ code to the generated code; I am not sure if HMMConverter does the same, but on the other hand HMMConverter provides handles for various constraints on the parameters so it might be easier to re-parameterize models made with that.

Another cool feature unique to HMMConverter is priors on sequence annotation.  You can provide an annotation to the input sequence(s) that is then incorporated in the emission probabilities.  The prior is really on hidden states, but incorporating them into the emission probabilities has exactly the effect you want from them: they weight the posterior probabilities of the hidden states along the input.

To deal with numerical issues, HMMConverter works in log-space while HMMoC uses something called “extended-exponent real numbers”.  Working in log-space can be really slow for the Forward and Backward algorithms, since you have to switch in and out of log-space to deal with sums of probabilities (the Viterbi algorithm doesn’t have this problem, so there the log-space solution is pretty fast).

Unfortunately, there isn’t any comparison between the execution times of algorithms generated with the two tools in the new paper, so I don’t know how much this matters.  In the HMM library I am developing with Andreas we found that the log-solution was very slow, though, and therefore we use a re-scaling approach instead.

I would love to see a comparison of the runtime efficiency between the approaches, but just not quite enough to go and do it myself right now…

  • Lam, T., & Meyer, I. (2009). HMMCONVERTER 1.0: a toolbox for hidden Markov models Nucleic Acids Research DOI: 10.1093/nar/gkp662
  • Lunter, G. (2007). HMMoC a compiler for hidden Markov models Bioinformatics, 23 (18), 2485-2487 DOI: 10.1093/bioinformatics/btm350

261-289=-28