I actually read this paper months ago, but I found a reference to it in my TODO list and just read it again...
Sabath, Landan and Graur. PLoS ONE
Inferring the intensity of positive selection in protein-coding genes is important since it is used to shed light on the process of adaptation. Recently, it has been reported that overlapping genes, which are ubiquitous in all domains of life, seem to exhibit inordinate degrees of positive selection. Here, we present a new method for the simultaneous estimation of selection intensities in overlapping genes. We show that the appearance of positive selection is caused by assuming that selection operates independently on each gene in an overlapping pair, thereby ignoring the unique evolutionary constraints on overlapping coding regions. Our method uses an exact evolutionary model, thereby voiding the need for approximation or intensive computation. We test the method by simulating the evolution of overlapping genes of different types as well as under diverse evolutionary scenarios. Our results indicate that the independent estimation approach leads to the false appearance of positive selection even though the gene is in reality subject to negative selection. Finally, we use our method to estimate selection in two influenza A genes for which positive selection was previously inferred. We find no evidence for positive selection in both cases.
The topic is an interesting one, and a problem I worked on myself while I was in Oxford: analysing overlapping genes to identify selection.
Identifying selection can be somewhat tricky. Usually, we do the following:
- We assume that the mutation rate is the same for sites under selection as for sites that evolve neutrally. This is probably a reasonable assumption, and in any case a necessary one since we rarely have any idea about the mutation rate but only the substitution rate.
- With that assumption in mind, we try to estimate the neutral substitution rate (which should be the same as the mutation rate) and then look at the rate of substitution on sites we suspect are under selection. If the substitution rate is different than the neutral rate, then it must be caused by selection since we assume that the mutation rate is the same.
- The tricky part is figuring out the neutral substitution rate, since we don't a priori know which sites are neutral. So for protein coding genes we just assume that synonymous substitutions are neutral while non-synonymous could be under selection. This is more of a dodgy assumption since we actually know it to be false. Stuff like codon bias, for example, means that synonymous substitutions are also under selection, but we just hope that it doesn't screw up the estimate of the neutral substitution rate too much.
The problem with overlapping genes
For overlapping genes -- where there are different genes in different reading frames or on either strand, so the same nucleotides are part of more than one gene -- this approach is somewhat problematic.
The problem is that synonymous substitutions in one gene can be non-synonymous in another gene. So if selection is working on the other gene, you won't get an accurate estimate of the synonymous (neutral) substitution rate in the first gene. If you get the estimate of the neutral substitution rate wrong, and you compare this rate to the substitution rate of the sites you are interested in, you will tend to get false positives. If you underestimate the neutral substitution rate, you can end up classifying neutrally evolving sites as under adaptive selection, while if you overestimate the neutral substitution rate, you will end up classifying neutrally evolving sites as under purifying selection.
To deal with this, you need to model all the overlapping genes in your substitution model, which typically means you have to deal with neighbour-dependencies in your substitution model. This greatly complicates the model compared to models where you assume that each site (nucleotide or codon) evolves independently. You typically have to "hack" it in various ways (which is what we have done in our work in Oxford) or you need to use sampling methods that can be very time consuming (but see here for a recent efficient approach to that).
The method in this paper falls into the "hack" category; it doesn't model the full neighbour-dependency of sites but models the evolution of a "reference" codon taking into account flanking nucleotides in overlapping codons.
They define a codon substitution model this way, that can then be used to infer the substitution rate of the overlapping genes individually.
They then apply this method on Influenza genes that have previously been shown to be under positive selection when the dependency between overlapping genes is not taken into account, and show that by their method, that does take the gene dependency into account, there is no evidence for this selection.
An important result -- assuming that the new model is correct -- since it shows the danger of assuming independence between genes that clearly are not independent.
Sabath, N., Landan, G., & Graur, D. (2008). A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes PLoS ONE, 3 (12) DOI: 10.1371/journal.pone.0003996