Probabillistic whole-genome alignments reveal high indel rates in the human and mouse genomes

Today, while preparing for a thesis meeting with Ricky, I read Gerton’s paper

Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes

G. A. Lunter

Bioinformatics 2007; DOI: 10.1093/bioinformatics/btm185


Motivation: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes.

Results: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around gamma = 8, rather than the currently accepted value of about gamma = 14.

I knew the results before, from discussions with Gerton, but this is the first time I’ve actually read it.The paper concerns the biases in placing gaps in alignment algorithms (whether probabilistic or parsimony based) and how these will tend to underestimate the number of indels in the true alignment and thus the indel rate.

Gap errors

The problem with gaps is that it is almost always better to have a few extra substitutions compared to a few extra gaps, since indels are less frequent and so the occurrence of them are less likely. When maximising the likelihood of the alignment, we therefore tend to remove gaps that should be there (even unlikely events do occur from time to time) and instead adds substitutions that should be there.

Unbiased estimator

Using statistical alignment and posterior decoding Gerton derives another estimator for the indel rate and shows that this essentially removes the bias. The essential idea is that when the alignment is derived through the statistical alignment algorithm, areas where gaps are misplaced will have a lower posterior certainty. The optimal alignment that is derived is not significantly more likely than several others, so the posterior probability of that exact alignment is less than it would be if placement of the gaps was more certain.

The new estimator is the red line on the plot on the right. The blue is what you would get if you just trusted the most likely alignment. The green line you get by fitting the neutral indel model from his earlier paper Genome-Wide Identification of Human functional DNA Using a Neutral Indel Model Lunter, Ponting and Hein, Plos Computational Biology 2006.

The reason the bias only shows when the substitutation rate is rather high is, of course, that you are less likely to mistake non-homologous sequences as homologous when mis-placing a gap if you have a low sequence identity on the true alignment compared to when you have a high sequence identity, i.e. when you have a low substitution rate.

The citation, for Research Blogging:
Lunter, G. (2007). Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics, 23(13), i289-i296. DOI: 10.1093/bioinformatics/btm185