Archive for the ‘Paper reviews’ Category

A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes

Sunday, August 16th, 2009

I actually read this paper months ago, but I found a reference to it in my TODO list and just read it again…

A method for the simultaneous estimation of selection intensities in overlapping genes

Sabath, Landan and Graur. PLoS ONE

Abstract

Inferring the intensity of positive selection in protein-coding genes is important since it is used to shed light on the process of adaptation. Recently, it has been reported that overlapping genes, which are ubiquitous in all domains of life, seem to exhibit inordinate degrees of positive selection. Here, we present a new method for the simultaneous estimation of selection intensities in overlapping genes. We show that the appearance of positive selection is caused by assuming that selection operates independently on each gene in an overlapping pair, thereby ignoring the unique evolutionary constraints on overlapping coding regions. Our method uses an exact evolutionary model, thereby voiding the need for approximation or intensive computation. We test the method by simulating the evolution of overlapping genes of different types as well as under diverse evolutionary scenarios. Our results indicate that the independent estimation approach leads to the false appearance of positive selection even though the gene is in reality subject to negative selection. Finally, we use our method to estimate selection in two influenza A genes for which positive selection was previously inferred. We find no evidence for positive selection in both cases.

The topic is an interesting one, and a problem I worked on myself while I was in Oxford: analysing overlapping genes to identify selection.

Identifying selection

Identifying selection can be somewhat tricky.  Usually, we do the following:

  1. We assume that the mutation rate is the same for sites under selection as for sites that evolve neutrally.  This is probably a reasonable assumption, and in any case a necessary one since we rarely have any idea about the mutation rate but only the substitution rate.
  2. With that assumption in mind, we try to estimate the neutral substitution rate (which should be the same as the mutation rate) and then look at the rate of substitution on sites we suspect are under selection.  If the substitution rate is different than the neutral rate, then it must be caused by selection since we assume that the mutation rate is the same.
  3. The tricky part is figuring out the neutral substitution rate, since we don’t a priori know which sites are neutral.  So for protein coding genes we just assume that synonymous substitutions are neutral while non-synonymous could be under selection.  This is more of a dodgy assumption since we actually know it to be false.  Stuff like codon bias, for example, means that synonymous substitutions are also under selection, but we just hope that it doesn’t screw up the estimate of the neutral substitution rate too much.

The problem with overlapping genes

For overlapping genes — where there are different genes in different reading frames or on either strand, so the same nucleotides are part of more than one gene — this approach is somewhat problematic.

The problem is that synonymous substitutions in one gene can be non-synonymous in another gene.  So if selection is working on the other gene, you won’t get an accurate estimate of the synonymous (neutral) substitution rate in the first gene.  If you get the estimate of the neutral substitution rate wrong, and you compare this rate to the substitution rate of the sites you are interested in, you will tend to get false positives.  If you underestimate the neutral substitution rate, you can end up classifying neutrally evolving sites as under adaptive selection, while if you overestimate the neutral substitution rate, you will end up classifying neutrally evolving sites as under purifying selection.

To deal with this, you need to model all the overlapping genes in your substitution model, which typically means you have to deal with neighbour-dependencies in your substitution model.  This greatly complicates the model compared to models where you assume that each site (nucleotide or codon) evolves independently.  You typically have to “hack” it in various ways (which is what we have done in our work in Oxford) or you need to use sampling methods that can be very time consuming (but see here for a recent efficient approach to that).

The method in this paper falls into the “hack” category; it doesn’t model the full neighbour-dependency of sites but models the evolution of a “reference” codon taking into account flanking nucleotides in overlapping codons.

They define a codon substitution model this way, that can then be used to infer the substitution rate of the overlapping genes individually.

They then apply this method on Influenza genes that have previously been shown to be under positive selection when the dependency between overlapping genes is not taken into account, and show that by their method, that does take the gene dependency into account, there is no evidence for this selection.

An important result — assuming that the new model is correct — since it shows the danger of assuming independence between genes that clearly are not independent.


Sabath, N., Landan, G., & Graur, D. (2008). A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes PLoS ONE, 3 (12) DOI: 10.1371/journal.pone.0003996
228-232=-4

PopABC: a program to infer historical demographic parameters

Saturday, August 15th, 2009

Just saw this paper in Bioinformatics today:

PopABC: A program to infer historical demographics parameters

Lopes, Balding and Beaumont

Abstract

Summary: PopABC is a computer package for inferring the pattern of demographic divergence of closely related populations and species. The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios. The program can be used either for research or for education and teaching purposes.

Availability and Implementation: Source code and binaries are freely available at http://www.reading.ac.uk/~sar05sal/software.htm. The program was implemented in C and can run on UNIX, MacOSX and Windows operating systems.

The paper is just an application note, so it doesn’t tell you much about the method, but the documentation on the software page is pretty good, so you can learn more from it there.

The problem addressed is the same as we address with our coalHMM approach: figuring out population genetic parameters in ancestral species and dating the speciation events.

While our approach is based on hidden Markov models and analysing entire genome alignments, theirs is based on ABC and analyses a set of loci assumed to be independent (free recombination between them so no LD).  In that way it is similar to the MCMC methods in IM and MIMAR.

Using sampling methods is probably necessary to analyse many sequences (many, in this case, means something like 5 and above), but I am not sure exactly how well the method scales to long sequences and many loci.  I did play around with the tool a bit, but I haven’t figured out the input file format well enough to simulate my own data for it, so I have only used the toy examples distributed with the software.

One other thing I haven’t figured out yet is if this model allows recombination within loci.  MIMAR does, IM does not, but I couldn’t find anything in the documentation for popABC about it.

If it doesn’t, that is possibly a bit of a weakness.  Assuming free recombination between loci and no inter-locus recombination just doesn’t feel right.  Also, with the simulations and calculations I did for our own model, I was really surprised by how short segments that share a MRCA really are in ancestral species if Ne is reasonably large and the coalescent/recombination process is close to equilibrium at the time of the speciation event, so I guess (but don’t really know for sure) that this could put some constraints on the species you could model while ignoring inter-locus recombination.

For something like Drosophila, with pretty large Ne (compared to, say, apes, but small compared to bacteria) I estimated expected fragment lenghts of shared MRCA between erecta and yakuba in the 20-50 range.  With something like that, you cannot pick loci of any reasonable length and assume no recombination with them.

Anyway, I will definitely look more at this approach.


Lopes, J., Balding, D., & Beaumont, M. (2009). PopABC: a program to infer historical demographic parameters Bioinformatics DOI: 10.1093/bioinformatics/btp487
227-230=-3

Multiple Testing in Genome-Wide Association Studies via Hidden Markov Models

Monday, August 10th, 2009

I just read a new paper out in “advanced access” in Bioinformatics:

Multiple Testing in Genome-Wide Association Studies via Hidden Markov Models

Wei et al.

Abstract

Motivation: Genome wide association studies (GWAS) interrogate common genetic variation across the entire human genome in an unbiased manner and hold promise in identifying genetic variants with moderate or weak effect sizes. However, conventional testing procedures, which are mostly p-value based, ignore the dependency and therefore suffer from loss of efficiency. The goal of this article is to exploit the dependency information among adjacent SNPs to improve the screening efficiency in GWAS.

Results: We propose to model the linear block dependency in the SNP data using hidden Markov Models. A compound decision-theoretic framework for testing HMM-dependent hypotheses is developed. We propose a powerful data-driven procedure (PLIS) that controls the false discovery rate (FDR) at the nominal level. PLIS is shown to be optimal in the sense that it has the smallest false negative rate (FNR) among all valid FDR procedures. By re-ranking significance for all SNPs with dependency considered, PLIS gains higher power than conventional p-value based methods. Simulation results demonstrate that PLIS dominates conventional FDR procedures in detecting disease associated SNPs. Our method is applied to analysis of the SNP data from a GWAS of type 1 diabetes. Compared to the BH procedure, PLIS yields more accurate results and has better reproducibility of findings.

Conclusion: The genomic rankings based on the our procedure are substantially different from the rankings based on the p-values. By integrating information from adjacent locations, the PLIS rankings benefit from the increased signal to noise ratio, hence our procedure often has higher statistical power and better reproducibility. This provide a promising direction in large-scale GWAS.

Summary

The topic is multiple testing correction in genome wide association studies (GWAS), which is probably one of the most important issues in such studies.  With the very large number of tests – typically hundreds of thousands to a million – you need to correct your significance value to avoid drowning in false positives.

The false discovery rate (FDR) method is a way of doing this, that essentially ranks the p-values and then picks the smallest while keeping the cumulative sum below the desired significance level.  Doing this in a GWAS ignores the dependency between tests caused by linkage disequilibrium, however, and this paper improves on this by taking the dependency into account.

They do this by fitting the data to a hidden Markov model where the hidden states are associated/not-associated and the emissions are z-values (the null for not associated and a mixture of normals for associated).  From this they can get a posterior probability of association/non-association for each marker, conditional on the test statistics for all markers in the genome: P(\theta_i \,|\, \mathbf{z}) where \theta_i is the state at marker i (0 if the marker is not associated with the disease and 1 if it is) and \mathbf{z} is the test statistics for all markers.

Now they consider all the P(\theta_i=0 \,|\, \mathbf{z}), that is all the posterior probabilities of not being associated, order them, and pick markers as long as the cumulative posterior probability is less than the significance threshold.

They say that this approach 1) guarantees that the false discovery rate is below the threshold and 2) that it is optimal in the sense that it is the method with that guarantee that has fewest false negatives, but they refer to an appendix for the proof of that, and that appendix is not in the paper, so I cannot really check that.  I would have loved to, though, since I want to know which assumptions about the data underlies this proof, but no matter…

Anyway, on with the summary.

They now validate the method with two simulation setups; one based on data simulated by a hidden Markov model, so matching the inference method and one with more realistic data.  For the first simulation study they show that the FDR guarantees are met and that the new method is more sensitive than those they compare it with.  For the second simulation study they essentially only show that it ranks true associations better than plain p-values.

They apply the method to a real data set and show that they are more successful in ranking markers that can be replicated in a replication cohort, again compared to plain p-values.

The good

First of all, I think it is an important problem to attack.  Doing so while taking the correlation between markers – and through that the correlation between their test statistics – is definitely the way to go.

Using hidden Markov models is also very sensible.  They are computationally efficient, usually easy to extend in various ways, and well founded in statistics so results are (relatively) easy to relate to.

The bad

I do have some problems with the method, though.  First some minor issues.

If the method really does compute the posterior probability of a marker being associated with the disease, then you would expect to consider all markers i where P(\theta_i=1\,|\,\mathrm{z})>P(\theta_i=0\,|\,\mathrm{z}) since those are the markers that are more likely to be assocated than not associated!  Picking only some of them means that for the rest of them you are essentially betting for a hypothesis less likely than the one you reject.

The issue here is, of course, that the value computed is in fact not the posterior probability of being associated.  The prior probability of association versus non-associated is probably not taken into account.  If it is, I couldn’t find any mentioning of it, at least.

If you include this prior belief you could just include the prior odds in your test and you would have a different approach to judging significance.  In practice it probably doesn’t matter much, so it is more an objection of aesthetics.

The ugly

Ok, now we come to the part about the paper I really didn’t like.  The simulation studies used to validate the method.

The first simulation study, I just can’t put much trust in. Not that I think there is anything dodgy in the results reported, but the simulations are from hidden Markov models matching the inference method, and there is no way that real data is generated that way.  There is nothing wrong with modelling the data as hidden Markov models – even if it is not generated by a process that even remotely resembles it – since it is just an analysis strategy anyway, but the simulation validation based on an unrealistic assumption is not particularly convincing…

The second simulation study is a more convincing setup, since here it is real LD data and a more realistic disease model.  However, here the FDR is not reported, only the sensitivity of seeing a marker in LD with a causal marker in top K of the ranking.  There is a better ranking with this approach than there is with just the p-values of the individual tests, but that says nothing about what the false discovery rate is.  So based on the results presented, I have no way of knowing what FDR to expect on realistic data (or how often I get a real hit below the FDR threshold for that matter).

I think this is a major problem with the validation of the method.  It is really only validated on data that is unlikely to resemble real GWAS data.  At least, the part of the method that has to do with FDR – the ranking results are okay.

Summary

Ok, I don’t want to end up sounding all negative.  It just looks that way since I ordered the criticism good -> bad -> ugly.

I stand by the criticism – I do think there are some problems with the validation – but all in all I like the method and I will definitely keep it in mind for my own future work.  I will look at that, as soon as they put up the source on CRAN (the paper just says that it will be made available there, but not when).

The main problem I have with the validation is really only the claims about false discovery rate.  Ok, since that is what the method is supposed to handle, that is a major problem, but as a method for ranking markers it looks pretty good.  Taking neighbouring markers into account in the analysis is what does this, I think, and is what we have also observed with our methods.

…and, you know, if you have a good ranking, maybe the false discovery rate isn’t all that important!  We only trust markers we validate in a replication data set anyway, and we will probably try to validate “top k” rather than all markers below a certain false discovery rate.  So in practical terms, the ranking is probably much more important than the multiple test correction.

Not so in the replication, of course, there you have to be strict about significance, but I’m not sure we need to be that strict in the initial discovery data set, as long as we don’t try to replicate thousands and thousands of markers, and we are probably not going to do that anyway.

I started my review by saying that correcting for multiple testing is very important in GWAS, and it is and it is very worthwhile to develop methods for it, but improving the ranking of markers is one of the problems that is even more important.

Wei, Z., Sun, W., Wang, K., & Hakonarson, H. (2009). MultipleTesting in Genome-Wide Association Studies via Hidden Markov Models Bioinformatics DOI: 10.1093/bioinformatics/btp476
222-225=-3

Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach

Tuesday, July 7th, 2009

We just got a new paper out – in “Advanced Access” at least – on coalescent hidden Markov models:

Ancestral population genomics: the coalescent hidden Markov approach

J. Dutheil et al.

Genetics

Abstract

With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. The amount of ILS depends on population parameters such as the ancestral effective population sizes and the recombination rate, but also on the number of generations between speciation events. We use a hidden Markov model parametrized according to coalescent theory in order to infer the genealogy along a four-species genome alignment of closely related species, and estimate population parameters. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions, and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias, and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment can be recovered efficiently together with the ancestral recombination rate. Finally, we introduce an extension of the basic model that allows for mutation rate heterogeneity, and reanalyze Human-Chimpanzee-Gorilla-Orangutan alignments using the new models. We expect that this framework will prove useful for population genomics and provide exciting insights into genome evolution.

This paper has been a long time underway.  Pretty much Julien’s entire post doc, actually, but there are some upcoming application papers that still makes all this work worthwhile.

There are two main results in the paper.

First, a new parameterisation of the hidden Markov model that directly parameterises the HMM in terms of population genetic parameters such as effective population size and recombination rate.  This is mainly the work of Ganesh and Marcy, we collaborated with on this paper.  In our 2007 paper, we parameterised the model just like any hidden Markov model but then extracted population genetics parameters from estimated transition and emission probabillities; in this paper we can do maximum likelihood parameter estimation directy from the coalescent process.

Second, we have a much more detailed simulation validation of the model.  From extensive simulations we have validated the model and discovered its strengths and weaknesses.  Of the latter, especially of importance is various biases in parameter estimates.  We discovered some systematic biases in estimates of speciation time and, especially, recombination rate.  The latter we didn’t even consider in the first paper, but the bias in the former probably means that the speciation time estimate of human and chimp in our 2007 paper was biased and somewhat more recent that the real speciation time.

Julien came up with a simulation approach to alleviate the biases, and although this approach is somewhat time consuming it does seem to improve the estimates.

While the paper was in review, we identified some of the sources of the biases, and we now have a model that looks much less biased than the one in the paper.  It doesn’t completely remove the bias on the recombination rate, but is much better at estimating the other parameters. It is based on the continuous time Markov models I have described here and here, but results are still somewhat preliminary and the model can only deal with two genomes and not incomplete lineage sorting, so it is still a long way from handling data like that in this new Genetics paper.

We have a draft of a paper describing the new method, and some results for the orangutang genome project that will probably be out later this year or early next year, so I will not go into details about it here.  There are still a lot of details to work out on that model before we know exactly how well it performs compared to the old one.

Anyway, the work we did on this paper told us a lot about the coalescent hidden Markov model approach.  Mainly good stuff at that, the biases aside.  It is a very fast method – fast enough to analyze full genomes – and is pretty good at estimating speciation times.  The latter is somewhat problematic when the average genomic divergence time varies significantly from the speciation time due to large effective population sizes, so the new model should be much better at it than plain old “molecular clock” estimates.

  • Dutheil, J., Ganapathy, G., Hobolth, A., Mailund, T., Uyenoyama, M., & Schierup, M. (2009). Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach Genetics DOI: 10.1534/genetics.109.103010
  • Hobolth, A., Christensen, O., Mailund, T., & Schierup, M. (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model PLoS Genetics, 3 (2) DOI: 10.1371/journal.pgen.0030007

187-192=-5

Widespread genomic signatures of natural selection in hominid evolution

Tuesday, May 12th, 2009

Friday last week, PLoS Genetics published a paper I’ve been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I’ll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I’ve been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept “effective population size” is a strange beast.  It doesn’t really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There’s a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor — the effective population size — to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day…

Recombination

Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor — and thus the same divergence — with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.

Results

First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla — with chimp as an outgroup — and those grouping chimp with gorilla — with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn’t in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way — assuming that the substitution rate doesn’t change dramatically along the genome and along the phylogeny — will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn’t prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.

Consequences

So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, ’cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn’t.  Even under neutrality this isn’t true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.


McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471

132-145=-13