Archive for May 22nd, 2008

Recombination and substitution rates

Thursday, May 22nd, 2008

ResearchBlogging.orgIn a paper from PLoS Genetics earlier this month, Laurent Duret and Peter F. Arndt did a genome wide analysis of the correlation between recombination rate and substitution rate (and bias).

The Impact of Recombination on Nucleotide Substitutions in the Human Genome

Duret, L., Arndt, P.F. PLoS Genetics, 4(5) 2008

Abstract

Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue for detecting selection within sequences. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We analyzed the pattern of neutral substitutions in 1 Gb of primate non-coding regions. We show that the GC-content toward which sequences are evolving is strongly negatively correlated to the distance to telomeres and positively correlated to the rate of crossovers (R2 = 47%). This demonstrates that recombination has a major impact on substitution patterns in human, driving the evolution of GC-content. The evolution of GC-content correlates much more strongly with male than with female crossover rate, which rules out selectionist models for the evolution of isochores. This effect of recombination is most probably a consequence of the neutral process of biased gene conversion (BGC) occurring within recombination hotspots. We show that the predictions of this model fit very well with the observed substitution patterns in the human genome. This model notably explains the positive correlation between substitution rate and recombination rate. Theoretical calculations indicate that variations in population size or density in recombination hotspots can have a very strong impact on the evolution of base composition. Furthermore, recombination hotspots can create strong substitution hotspots. This molecular drive affects both coding and non-coding regions. We therefore conclude that along with mutation, selection and drift, BGC is one of the major factors driving genome evolution. Our results also shed light on variations in the rate of crossover relative to non-crossover events, along chromosomes and according to sex, and also on the conservation of hotspot density between human and chimp.

The main point of this paper is the evolution of the GC content of the human genome, that varies significantly in various regions of the genome — the so-called isochore structure.

The evolution of isochores

The content of GC nucleotides vary along the genome, with some regions having very high fractions of GC and some having very low, and this variation is not what we would expect the sequence to look like if the entire genome was evolving under the same neutral process.

Why the genome has this structure has been debated (at time heated debates) the last two decades. Different explanations have been suggested, including:

  1. The mutation rate is biased and varies along the genome.
  2. Selection prefers high GC content in some regions and not in others.
  3. Gene conversion is biased, preferring to replace AT alleles with GC alleles.

where the later is a theory developed, among others, by the authors of this new paper.

Biased mutation rates is of course a possibility, but doesn’t explain the correlation with the recombination rate, unless the latter is mutagenic or causes this bias.

Selection is the explanation of Bernardi, the discoverer of the isochore structure.

Biased gene conversion is a neutral process that looks a lot like selection. The idea is as follows: there is no particular need for a bias in the mutation process — the AT to GC and GC to AT substitutions are not necessarily occurring at different rates in GC rich and GC poor regions — but once a polymorphism exists, gene-conversion between a GC allele and an AT allele will replace the AT allele with the GC allele more often than the other way around.

A consequence of this is, that although the mutation rate might not vary along the genome, the substitution rate will, and this substitution rate will be correlated with the recombination rate.

Eyre-Walker and Hurst (2001) gives more details on the three theories above.

The case for biased gene conversion

In the PLoS Genetics paper they argue for the biased gene conversion explanation (not surprisingly), and reasonably convincingly, in my opinion, but I am not an expert…

First, they construct a model of sequence evolution that does not assume time-reversibility and that the current sequences are at stationarity (which is usually assumed, but might not be true).

From this model, they estimate the substitution rate of the various types of substitutions, and they estimate the equilibrium GC content (called GC* in the paper). In the model, the equilibrium GC content can be different than the current GC content, as stationarity is not assumed, and in general GC* < GC meaning that the GC content in our genome — and this especially in GC rich areas — is decreasing. Very slowly, though.

This could suggest that whatever mechanism created the GC rich areas of our genome is either no longer in effect, or at least is weaker than it was when the GC rich areas were created.

They then consider the correlation between recombination rates and GC / GC* and notice a significant correlation, with a stronger correlation between recombintion rate and GC* than between recombination and GC.

This is take as evidence that it is recombination that drives the direction of mutations toward GC content, rather than base pair composition that determines recombination rate; if the recombination rate was determined by the base pair composition, then the present day GC content should be more correlated with the rate than some far future stationary GC content.

The biased gene conversion model suggest a preference for AT to GC substitutions in regions with high recombination rates, but where the strength of this preference depends on the effective population size.

The positive correlation between GC* and the recombination rate supports this, and the present day effective population size (or the present day recombination rate) can explain why the GC structure in the genome is eroding towards a higher AT content in the present day GC rich regions. The GC rich regions of today could have appeared in an ancestor with either a larger effective population size, or regional larger recombination rates, and the reduction in the effective population size in the present day humans is just not large enough that the biased gene conversion mechanism can keep the GC content at a high level.

The case against biased mutation and against selection

The biased mutation explanation is argued against based on the frequency patterns of polymorphisms. If the mutations are biased, but the resulting polymorphisms are selectively neutral, then the frequency of GC and AT derived polymorphisms should be the same.  However, GC alleles segregate at higher frequencies than AT alleles.

The first argument against selection is less convincing, I feel, but essentially says: it is hard to imagine why selection should prefer the occasional GC  in Mbp long regions with plenty of genes under selection, and even if it did, it probably wouldn’t be strong enough to drive the changes in GC content.  Well…

The second argument is that selection does not explain why GC content, and especially GC*, should be correlated with the recombination rate.  One possible explanation is the Hill-Robertson effect, but then the correlation should be between GC* and the population recombination, but GC* is stronger correlated with male recombination rate than with female recombination rate, something Hill-Robertson does not explain.

Conclusion

I read this paper because I was reading up on the correlation between effective population size and recombination rate for a project I’m working on.  I knew about the debate about isochores — I’ve chatted with some of the biased gene conversion proponents who have visited BiRC — but I never really read up on it.

It turns out that several of my colleagues at BiRC are interested in this, so we’ve discussed the paper over the last two days, and I’ve had a lot of fun reading my way through some of the references in the paper.

I would recommend it as an introduction to this, but of course not a neutral discussion of the three theories.


Duret, L., Arndt, P.F. (2008). The Impact of Recombination on Nucleotide Substitutions in the Human Genome. PLoS Genetics, 4(5), e1000071. DOI: 10.1371/journal.pgen.1000071

Eyre-Walker, A., Hurst, L.D. (2001). The evolution of isochores. Nature Reviews Genetics, 2(7), 549-555. DOI: 10.1038/35080577

Software decay and software repositories

Thursday, May 22nd, 2008

bbgm suggests:

In essence this expands on the issue that I have been raising lately; that academics should use code repositories like Google Code, Sourceforge or Github. That not only moves some of the issues with code maintenance infrastructure and utilities out onto the cloud, it also brings in the ability of a bigger user base, ability to access mode more easily, etc.

Will this solve the problem of URL decay mentioned in the latest issue of Bioinformatics?

URL decay in MEDLINE — a 4-year follow-up study

Jonathan D. Wren Bioinformatics 2008 24(11):1381-1385; doi:10.1093/bioinformatics/btn127

Abstract

Motivation: Internet-based electronic resources, as given by Uniform Resource Locators (URLs), are being increasingly used in scientific publications but are also becoming inaccessible in a time-dependant manner, a phenomenon documented across disciplines. Initial reports brought attention to the problem, spawning methods of effectively preserving URL content while some journals adopted policies regarding URL publication and begun storing supplementary information on journal websites. Thus, a reexamination of URL growth and decay in the literature is merited to see if the problem has grown or been mitigated by any of these changes.

Results: After the 2003 study, three follow-up studies were conducted in 2004, 2005 and 2007. Unfortunately, no significant change was found in the rate of URL decay among any of the studies. However, only 5% of URLs cited more than twice have decayed versus 20% of URLs cited once or twice. The most common types of lost content were computer programs (43%), followed by scholarly content (38%) and databases (19%). Compared to URLs still available, no lost content type was significantly over- or underrepresented. Searching for 30 of these websites using Google, 11 (37%) were found relocated to different URLs.

Conclusions: URL decay continues unabated, but URLs published by organizations tend to be more stable. Repeated citation of URLs suggests calculation of an electronic impact factor (eIF) would be an objective, quantitative way to measure the impact of Internet-based resources on scientific research.

It certainly seems like we are loosing our data and programs, so some larger repositories might be the way to go…

Nuts!

Thursday, May 22nd, 2008

In this letter to Nature, Raghavendra Gadagkar argues that the open access model — that typically means “pay to publish, but read for free” — is doing more harm to research in the developing world than the traditional “publish for free, but pay to read” model.

The reasoning is, that having to pay to publish means that publications are not a result of the quality of ones research, but just as much a result of ones funding, and in developing countries there is less funding.

This is, of course, a valid point, but to conclude from this that the open access model — even if it means you have to pay to publish — is doing more harm than good is, well, just nuts!

First of all, many top journal charges you both for publishing and for reading the articles. With open access, at least, you can read for free.

Secondly, even if the publishing charges are much higher than the reading charge, you only pay when you have a result worth publishing. I don’t know about you, but I personally read a lot more papers than I publish, and most papers I read are never cited in my own work, because they turn out not to be relevant for my own work.

Gadagkar ends his letter with:

A ‘publish for free, read for free’ model may one day prove to be viable. Meanwhile, if I have to choose between the two evils, I prefer the ‘publish for free and pay to read’ model over the ‘pay to publish and read for free’ one. Because if I must choose between publishing or reading, I would choose to publish. Who would not?

Of course we all prefer to publish our own papers, but you cannot, and should not, publish worthwhile research if you are not familiar with the work of other researchers and have read the literature. You cannot choose publishing over reading!

I’m not saying there isn’t a problem with publication charges, but I strongly disagree with the claim that it is worse than the charge for access to papers (and I remind you, once more, that in many cases you get both of the two evils…)