Posts Tagged ‘virus’

A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes

Sunday, August 16th, 2009

I actually read this paper months ago, but I found a reference to it in my TODO list and just read it again…

A method for the simultaneous estimation of selection intensities in overlapping genes

Sabath, Landan and Graur. PLoS ONE

Abstract

Inferring the intensity of positive selection in protein-coding genes is important since it is used to shed light on the process of adaptation. Recently, it has been reported that overlapping genes, which are ubiquitous in all domains of life, seem to exhibit inordinate degrees of positive selection. Here, we present a new method for the simultaneous estimation of selection intensities in overlapping genes. We show that the appearance of positive selection is caused by assuming that selection operates independently on each gene in an overlapping pair, thereby ignoring the unique evolutionary constraints on overlapping coding regions. Our method uses an exact evolutionary model, thereby voiding the need for approximation or intensive computation. We test the method by simulating the evolution of overlapping genes of different types as well as under diverse evolutionary scenarios. Our results indicate that the independent estimation approach leads to the false appearance of positive selection even though the gene is in reality subject to negative selection. Finally, we use our method to estimate selection in two influenza A genes for which positive selection was previously inferred. We find no evidence for positive selection in both cases.

The topic is an interesting one, and a problem I worked on myself while I was in Oxford: analysing overlapping genes to identify selection.

Identifying selection

Identifying selection can be somewhat tricky.  Usually, we do the following:

  1. We assume that the mutation rate is the same for sites under selection as for sites that evolve neutrally.  This is probably a reasonable assumption, and in any case a necessary one since we rarely have any idea about the mutation rate but only the substitution rate.
  2. With that assumption in mind, we try to estimate the neutral substitution rate (which should be the same as the mutation rate) and then look at the rate of substitution on sites we suspect are under selection.  If the substitution rate is different than the neutral rate, then it must be caused by selection since we assume that the mutation rate is the same.
  3. The tricky part is figuring out the neutral substitution rate, since we don’t a priori know which sites are neutral.  So for protein coding genes we just assume that synonymous substitutions are neutral while non-synonymous could be under selection.  This is more of a dodgy assumption since we actually know it to be false.  Stuff like codon bias, for example, means that synonymous substitutions are also under selection, but we just hope that it doesn’t screw up the estimate of the neutral substitution rate too much.

The problem with overlapping genes

For overlapping genes — where there are different genes in different reading frames or on either strand, so the same nucleotides are part of more than one gene — this approach is somewhat problematic.

The problem is that synonymous substitutions in one gene can be non-synonymous in another gene.  So if selection is working on the other gene, you won’t get an accurate estimate of the synonymous (neutral) substitution rate in the first gene.  If you get the estimate of the neutral substitution rate wrong, and you compare this rate to the substitution rate of the sites you are interested in, you will tend to get false positives.  If you underestimate the neutral substitution rate, you can end up classifying neutrally evolving sites as under adaptive selection, while if you overestimate the neutral substitution rate, you will end up classifying neutrally evolving sites as under purifying selection.

To deal with this, you need to model all the overlapping genes in your substitution model, which typically means you have to deal with neighbour-dependencies in your substitution model.  This greatly complicates the model compared to models where you assume that each site (nucleotide or codon) evolves independently.  You typically have to “hack” it in various ways (which is what we have done in our work in Oxford) or you need to use sampling methods that can be very time consuming (but see here for a recent efficient approach to that).

The method in this paper falls into the “hack” category; it doesn’t model the full neighbour-dependency of sites but models the evolution of a “reference” codon taking into account flanking nucleotides in overlapping codons.

They define a codon substitution model this way, that can then be used to infer the substitution rate of the overlapping genes individually.

They then apply this method on Influenza genes that have previously been shown to be under positive selection when the dependency between overlapping genes is not taken into account, and show that by their method, that does take the gene dependency into account, there is no evidence for this selection.

An important result — assuming that the new model is correct — since it shows the danger of assuming independence between genes that clearly are not independent.


Sabath, N., Landan, G., & Graur, D. (2008). A Method for the Simultaneous Estimation of Selection Intensities in Overlapping Genes PLoS ONE, 3 (12) DOI: 10.1371/journal.pone.0003996
228-232=-4

Statistical alignment and virus selection paper now online

Monday, July 21st, 2008

The paper I described in a previous post: Investigating selection on viruses: a statistical alignment approach, just got published online today.  Yeah us!

Investigating Selection on Viruses: A Statistical Alignment Approach

Wednesday, June 18th, 2008

ResearchBlogging.org
Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn’t one of the papers I’ve been bitching about — this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter — who does work with this — and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

S. de Groot, T. Mailund, G.A. Lunter and J. Hein

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I’ll add a link to the paper as soon as it is up at the journal.

What’s the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, ’cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won’t work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won’t scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth’s recent approach might have worked — it is the paper I blogged about a little back — but we didn’t know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time…

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn’t really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work…


de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics