Posts Tagged ‘comparative genomics’

Investigating Selection on Viruses: A Statistical Alignment Approach

Wednesday, June 18th, 2008

ResearchBlogging.org
Woohoo, we just got a paper accepted. Although it is at BMC Bioinformatics, it isn't one of the papers I've been bitching about -- this one we got very helpful reviews on.

It is work from when I was in Oxford. Saskia de Groot did analysis of virus genomes for her PhD (see papers here and here) but for viruses that are relatively far divergent, getting good alignments is a bit of a problem, so I suggested we took a statistical alignment approach to integrate over the uncertainty. So we got together with Gerton Lunter -- who does work with this -- and came up with this:

Investigating Selection on Viruses: A Statistical Alignment Approach

S. de Groot, T. Mailund, G.A. Lunter and J. Hein

To appear in BMC Bioinformatics

Abstract

Background: Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and improve in accuracy compared to the fixed alignment method.
Results: We run a series of simulation studies to gauge how well we do in comparison to other methods. We show that the standard practice of using a fixed ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We therefore propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Running our method on real data, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection. We also analyse Hepatitis B to understand the interaction of selection between two overlapping regions.

I'll add a link to the paper as soon as it is up at the journal.

What's the problem?

We were trying to figure out selection in viruses where genes can have overlapping reading frames. In such cases, figuring out the neutral substitution rate is a bit of a problem, 'cause a synonymous substitution in one gene can be a non-synonymous substitution in an overlapping gene. Using dN/dS to figure out selection won't work.

Instead we took and extended a method by Hein and Støvlbæk to explicitly model substitutions with selection in overlapping reading frames. We ought to consider the neighbour dependent substitutions you get when you are modelling codon changes (which again is complicated by overlapping genes), but methods for that can be very slow and won't scale to whole genomes. Even virus genomes. Pedersen and Jensen tried that in an MCMC approach. Hobolth's recent approach might have worked -- it is the paper I blogged about a little back -- but we didn't know about it at the time.

Anyway, we essentially have a method for modelling the evolution over overlapping genes, but we cannot trust the alignment of viruses because they are too divergent, and if we infer an optimal alignment it is almost certainly wrong. An optimal alignment will often have too few substitutions compared to the real alignment.

What did we do?

Since we cannot trust a single alignment, we instead sum over all possible alignments. Using hidden Markov models, we can do that, and at the same time calculate the probability of any single one of them.

We can then consider the substitutions in each of the alignments and weight the observed substitutions with the probability of the alignment. That way, the more likely alignment weigh in more when we consider substitutions than less likely.

It is similar to what Rahul Satija, Lior Pachter and Jotun Hein were doing for phylogenetic footprinting in the neighbour office at the samme time...

Using this approach, we show that we alleviate a systematic bias in using optimal alignments and get better estimates of selection factors.

We only handle pair-wise alignments but hack our way out of using more sequences to get better estimates still. It isn't really the best approach and we should probably try a Gibbs sampler to handle multiple sequence alignments, but that is left for future work...


de Groot, S., Mailund, T., Lunter, G.A., Hein, J. (2008). Investigating Selection on Viruses: A Statistical Alignment Approach . BMC Bioinformatics

Keynote talk on coalescent hidden Markov models and great ape speciation

Monday, June 9th, 2008

Wednesday this week I'm giving a talk at the Danish Society for Computer Science. I was asked to give the keynote talk at this years general assembly before they hand out this year's Best Thesis Award.

I'm tired of talking about my main project -- association mapping -- so I decided to give a talk on our CoalHMM work (although I am only working on that a small percentage of my time).

The slides file is too big to upload to slideshare, but if my flash hack works you should see them below:

You can also get it as a PDF file, a PowerPoint file or a Keynote file, if you want.

The last couple of presentations I've made I have been using the Keynote program on my Mac. I'm really happy with the program. It makes it very easy to put together pretty presentations (if I say so myself) with very little work. Especially its handling of graphics impresses me. Compared to using OpenOffice, as I usually did before getting the Mac, there's essentially no work involved in including graphics. Automatic masking and alpha channels beats modifying images in Gimp any day!

I couldn't really figure out how to put presentations on my homepage, though. Keynote files are really directories, so you cannot just copy them to the homepage directory. I found out, though, that if a Mac downloads a zip'ed Keynote file it will just open it in Keynote, so I guess that is the way to do that.

It must be hell, sequencing the Neanderthal

Monday, June 9th, 2008

Reading through a page at Nature about metagenomics (probably requires subscription...) I saw this sentence:

"The biggest metagenomic project on Earth might be our Neanderthal genome project," says Egholm. They are using 454 to sequence the complete genome of a Neanderthal, which Egholm says they hope to release by the end of the year. But 95–98% of the DNA in the Neanderthal sample comes from the environment rather than from a Neanderthal. This means that to get the 1 coverage, or roughly 3 billion base pairs, of the genome, the team must sequence somewhere between 70 billion to 100 billion base pairs of these environmental samples.

Sequencing the Neanderthal must be quite some challenge! Of course, contamination by bacteria should be fairly easy to discover and get rid of compared to contamination by the humans doing the sequences. We are just too closely related to the Neanderthal for that to be a simple task.

Of course, the Neanderthal specimens are handled carefully, but some contamination is unavoidable.  How much of a problem it is, I do not know, though.  I tried googling for it, but didn't really find any consistent answers.

I look forward to getting my hands on the Neanderthal sequence, though.  I would love running it through our CoalHMM analysis!

How do you calibrate the molecular clock?

Thursday, May 29th, 2008

How do you calibrate the molecular clock -- where you need a few known sequence divergence times -- when you only know a few speciation times?

Yesterday at a meeting (I'm not sure I can tell you which meeting; I'm not sure how open it is supposed to be :-/) we discussed the divergence time of human-orangutan and human-macaque. We need the sequence divergence time to calibrate a CoalHMM model for figuring out some speciation and population genetics parameters of ancestral species.

No definitive answer came up at the meeting, but there was a short discussion by email after the meeting. This paper was sent around, where the divergence times were estimated to 25MYA and 13MYA, respectively, although the last of those numbers is actually the calibration point used in the analysis, so it is an assumption more than an estimate.

The problem is, the 13MYA used for the calibration is based on fossil evidence, and as far as I can see, that would make it an estimate for the speciation time between human and orangutan. We need the sequence divergence time. Speciation time and divergence time can vary with millions of years (if the effective population size is large enough).

If 13MYA is the divergence time between human and orangutan, we get a speciation time that is unrealistically recent.  If the divergence time is 18MYA instead, as we assumed in this paper, we would get a speciation time around 12MYA which would match the MBE paper.

But how do you figure out the divergence time needed to calibrate the clock?  Is there any way to get it, rather than the speciation time, from fossil evidence?

For our purposes, I suppose we can just as well work with speciation times for our calibration, but not everyone is using CoalHMMs for their analysis, so how do you deal with this problem?

Recombination and substitution rates

Thursday, May 22nd, 2008

ResearchBlogging.orgIn a paper from PLoS Genetics earlier this month, Laurent Duret and Peter F. Arndt did a genome wide analysis of the correlation between recombination rate and substitution rate (and bias).

The Impact of Recombination on Nucleotide Substitutions in the Human Genome

Duret, L., Arndt, P.F. PLoS Genetics, 4(5) 2008

Abstract

Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue for detecting selection within sequences. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We analyzed the pattern of neutral substitutions in 1 Gb of primate non-coding regions. We show that the GC-content toward which sequences are evolving is strongly negatively correlated to the distance to telomeres and positively correlated to the rate of crossovers (R2 = 47%). This demonstrates that recombination has a major impact on substitution patterns in human, driving the evolution of GC-content. The evolution of GC-content correlates much more strongly with male than with female crossover rate, which rules out selectionist models for the evolution of isochores. This effect of recombination is most probably a consequence of the neutral process of biased gene conversion (BGC) occurring within recombination hotspots. We show that the predictions of this model fit very well with the observed substitution patterns in the human genome. This model notably explains the positive correlation between substitution rate and recombination rate. Theoretical calculations indicate that variations in population size or density in recombination hotspots can have a very strong impact on the evolution of base composition. Furthermore, recombination hotspots can create strong substitution hotspots. This molecular drive affects both coding and non-coding regions. We therefore conclude that along with mutation, selection and drift, BGC is one of the major factors driving genome evolution. Our results also shed light on variations in the rate of crossover relative to non-crossover events, along chromosomes and according to sex, and also on the conservation of hotspot density between human and chimp.

The main point of this paper is the evolution of the GC content of the human genome, that varies significantly in various regions of the genome -- the so-called isochore structure.

The evolution of isochores

The content of GC nucleotides vary along the genome, with some regions having very high fractions of GC and some having very low, and this variation is not what we would expect the sequence to look like if the entire genome was evolving under the same neutral process.

Why the genome has this structure has been debated (at time heated debates) the last two decades. Different explanations have been suggested, including:

  1. The mutation rate is biased and varies along the genome.
  2. Selection prefers high GC content in some regions and not in others.
  3. Gene conversion is biased, preferring to replace AT alleles with GC alleles.

where the later is a theory developed, among others, by the authors of this new paper.

Biased mutation rates is of course a possibility, but doesn't explain the correlation with the recombination rate, unless the latter is mutagenic or causes this bias.

Selection is the explanation of Bernardi, the discoverer of the isochore structure.

Biased gene conversion is a neutral process that looks a lot like selection. The idea is as follows: there is no particular need for a bias in the mutation process -- the AT to GC and GC to AT substitutions are not necessarily occurring at different rates in GC rich and GC poor regions -- but once a polymorphism exists, gene-conversion between a GC allele and an AT allele will replace the AT allele with the GC allele more often than the other way around.

A consequence of this is, that although the mutation rate might not vary along the genome, the substitution rate will, and this substitution rate will be correlated with the recombination rate.

Eyre-Walker and Hurst (2001) gives more details on the three theories above.

The case for biased gene conversion

In the PLoS Genetics paper they argue for the biased gene conversion explanation (not surprisingly), and reasonably convincingly, in my opinion, but I am not an expert...

First, they construct a model of sequence evolution that does not assume time-reversibility and that the current sequences are at stationarity (which is usually assumed, but might not be true).

From this model, they estimate the substitution rate of the various types of substitutions, and they estimate the equilibrium GC content (called GC* in the paper). In the model, the equilibrium GC content can be different than the current GC content, as stationarity is not assumed, and in general GC* < GC meaning that the GC content in our genome -- and this especially in GC rich areas -- is decreasing. Very slowly, though.

This could suggest that whatever mechanism created the GC rich areas of our genome is either no longer in effect, or at least is weaker than it was when the GC rich areas were created.

They then consider the correlation between recombination rates and GC / GC* and notice a significant correlation, with a stronger correlation between recombintion rate and GC* than between recombination and GC.

This is take as evidence that it is recombination that drives the direction of mutations toward GC content, rather than base pair composition that determines recombination rate; if the recombination rate was determined by the base pair composition, then the present day GC content should be more correlated with the rate than some far future stationary GC content.

The biased gene conversion model suggest a preference for AT to GC substitutions in regions with high recombination rates, but where the strength of this preference depends on the effective population size.

The positive correlation between GC* and the recombination rate supports this, and the present day effective population size (or the present day recombination rate) can explain why the GC structure in the genome is eroding towards a higher AT content in the present day GC rich regions. The GC rich regions of today could have appeared in an ancestor with either a larger effective population size, or regional larger recombination rates, and the reduction in the effective population size in the present day humans is just not large enough that the biased gene conversion mechanism can keep the GC content at a high level.

The case against biased mutation and against selection

The biased mutation explanation is argued against based on the frequency patterns of polymorphisms. If the mutations are biased, but the resulting polymorphisms are selectively neutral, then the frequency of GC and AT derived polymorphisms should be the same.  However, GC alleles segregate at higher frequencies than AT alleles.

The first argument against selection is less convincing, I feel, but essentially says: it is hard to imagine why selection should prefer the occasional GC  in Mbp long regions with plenty of genes under selection, and even if it did, it probably wouldn't be strong enough to drive the changes in GC content.  Well...

The second argument is that selection does not explain why GC content, and especially GC*, should be correlated with the recombination rate.  One possible explanation is the Hill-Robertson effect, but then the correlation should be between GC* and the population recombination, but GC* is stronger correlated with male recombination rate than with female recombination rate, something Hill-Robertson does not explain.

Conclusion

I read this paper because I was reading up on the correlation between effective population size and recombination rate for a project I'm working on.  I knew about the debate about isochores -- I've chatted with some of the biased gene conversion proponents who have visited BiRC -- but I never really read up on it.

It turns out that several of my colleagues at BiRC are interested in this, so we've discussed the paper over the last two days, and I've had a lot of fun reading my way through some of the references in the paper.

I would recommend it as an introduction to this, but of course not a neutral discussion of the three theories.


Duret, L., Arndt, P.F. (2008). The Impact of Recombination on Nucleotide Substitutions in the Human Genome. PLoS Genetics, 4(5), e1000071. DOI: 10.1371/journal.pgen.1000071

Eyre-Walker, A., Hurst, L.D. (2001). The evolution of isochores. Nature Reviews Genetics, 2(7), 549-555. DOI: 10.1038/35080577