Posts Tagged ‘Paper review’

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Wednesday, August 26th, 2009

A few days ago I wrote about the hypothesis of complex speciation between humans and chimps, and today I’ll briefly discuss another paper on the human / chimp speciation:

Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation

Matthew T. Webster, Gene 443 70-75, 2009

Abstract

There is a large variation in divergence times across genomic regions between human and chimpanzee. It has been suggested that this could partly result from selection against ancestral gene flow between incipient species in regions of the genome containing genetic incompatibilities. It is possible that such barriers to gene flow could arise in specific genes or in chromosomal inversions. I analysed patterns of lineage sorting that occur between human, chimpanzee and gorilla genomic sequences by examining divergent site patterns in > 18 Mb genomic alignments. I develop a method to normalise site patterns by the mutational spectrum to minimise errors caused by misinference caused by recurrent mutation. Here I show that divergence times appear to be uniform between coding and noncoding sequences and between inverted and non-rearranged portions of chromosomes. I therefore find no evidence to support the large-scale accumulation of genetic incompatibilities at speciation genes or chromosomal inversions in the ancestral population of humans and chimpanzees. In addition, site patterns that are discordant with the species tree occur more frequently in regions with high human recombination rates. This could indicate the action of selective sweeps in the ancestral population, but could also be indicative of increased rates of homoplasy in these regions. I argue that these observations are compatible with a neutral allopatric model of speciation.

Models of speciation

Speciation happens when gene flow stops between one group of a species and another (and doesn’t start again later or we get something like the hybridization scenario I wrote about in my earlier post).

There are different ways this can happen.  For instance, one group might somehow find itself geographically isolated from the other – e.g. find themselves on the other side of a large river – effectively isolating the group from the rest of the species.  This is know as allopatric speciation (or depending on exactly how this plays out, peripatric speciation).

In this scenario, the speciation happens at the time where the groups are isolated.  From that point and onwards the groups are essentially different species, since gene flow has stopped.  It will take some time before the groups are incapable if inter-breeding, but unless they actually merge again at some time before then, the time of the speciation event is the time the groups get separated.

That doesn’t mean that the genomic divergence time between the two species matches the time back to the speciation event.  Some individuals in one of the groups might be closer related to individuals in the second group than the other individuals in the first group for a few generations.  So the genetic distance between the two species is a bit larger than the “species distance”.  Add in recombination and the picture gets a bit more complex.

Still, we can talk about a specific point in time where the speciation time occurred and we have a mathematical model – the coalescent model – of the genome distance between the two species that depends on this time and the population genetics in the ancestral species before then.

The speciation can also be caused by “genetic isolation”.

If a new mutation enters the group, where homozygotes for either the wildtype or the mutants are fitter than the heterozygotes, then the group will tend to split into two.  The mutants and the wildtypes.

Without recombination, there wouldn’t be much difference in the genomic distance between the two resulting species.  The heterozygotes would be selected against and the two homozygotes would diverge.

With recombination, again the situation gets a bit more complicated.  The heterozygotes would still be selected against, but assuming heterozygoes still manage to mate from time to time, you would get homozygote offsprings of heterozygoes who are just as fit as other homozygotes.

Because there is selection against heterozygoes you will tend to split the species into two – the two homozygoes – but the divergence will be deeper at the locus of the mutation than it will in the rest of the genome.

We call such a locus a “speciation gene” and candidates for such genes are functional genes (where we expect some selection) or structural variations such as inversions.

Back to the paper…

What Webster looks at in this paper is the patterns of divergence – especially deep coalescence events with incomplete lineage sorting where we observe sites grouping human and gorilla or chimp and gorilla – in the genome.

He then looks at these patterns in genes, introns, inversions … the candiates for speciation genes, to see if these looks like they are more divergent than the rest of the genome.  If so, then the speciation between humans and chimps could be caused by speciation genes.  If not, then the speciation could be allopatric (the same “species divergence” throughout the genome, but of course not the exact same sequence divergence since the coalescence times will still vary along the genome).

Long story short, he doesn’t find any evidence for deeper divergence these places so we cannot rule out an allopatric speciation here.

He does find a correlation between recombination rate and deep divergence, which can be explained by either increased mutability in regions of high recombination or selective sweeps in the ancestral species.  The latter is much more interesting, really, but we cannot rule out the first explanation so I won’t comment much on this here…

Critisism

I do have a slight problem with the analysis in the paper, though.

It seems to me that by just looking at differences in divergence time between genes and the rest of the genome – or between inversions and the rest of the genome or whatnot – is not particularly powerful for detecting speciation genes.

When comparing general groups like this, it seems to me that a few speciation genes would simply be drowned out by the larger number of “plain old genes”.  So all the analysis is really saying is that there isn’t a large number of speciation genes between humans and chimps, not that there are none.

The paper doesn’t claim any more than this either, but it would be interesting to work out just how large a fraction of the genes would have to be speciation genes – and how large a difference between the divergence of speciation genes and the rest of the genome there has to be – to be able to distinguish between the two scenaria with this analysis.

I haven’t done the math yet, but I plan to when I get the time…


Webster, M. (2009). Patterns of autosomal divergence between the human and chimpanzee genomes support an allopatric model of speciation Gene, 443 (1-2), 70-75 DOI: 10.1016/j.gene.2009.05.006
238-243=-5

New paper out

Wednesday, March 4th, 2009

We just got a new paper out yesterday in BMC Medical Genetics:

Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups

Schierup et al.

BMC Medical Genetics 2009, 10:20 doi:10.1186/1471-2350-10-20

Abstract

Background

A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.

Methods

We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.

Results

We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.

Conclusions

The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.

Mikkel and I got involved in the project to try to use our haplotype based association mapping methods to analyse data where a single marker analysis had already shown an association with several kinds of cancer.

We didn’t really discover anything new when running our tools on the data, so to try something else we combined the case/control data with HapMap data to try to increase the number of markers through imputation.

That is when we discovered that a haplotype in the region, that is found in about 50% of Europeans (CEU and our case/control data) is only found in ~1% of Africans (YRI).  Furthermore, this haplotype was the at-risk haplotype in our case/control data and looks to be the derived haplotype when compared with the chimp genome.

Reference

Mikkel H Schierup, Thomas Mailund, Heng Li, Jun Wang, Anne Tjonneland, Ulla Vogel, Lars Bolund, Bjorn A Nexo (2009). Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups BMC Medical Genetics, 10 (1) DOI: 10.1186/1471-2350-10-20

63-82=-19

It is not all bad news

Thursday, June 19th, 2008

ResearchBlogging.orgOkay, yeah, so I broke my iMac today, but there is also good news.  We just got another paper accepted, this time a conference paper at this year’s WABI.

Since it is on neighbour-joining, we weren’t that optimistic.  We’ve had problems publishing on this before, but this time it was very well received.

Accelerated neighbour-joining

M. Simonsen, T. Mailund and C.N.S. Pedersen

Abstract

 The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criteria for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour- joining implementations.

It’s really Martin SImonsen’s work. He is a Mater’s student at BiRC and in one of our algorithmics courses the students were asked to implement neighbour-joining and try to speed it up.  Usually, they come up with some clever ideas, but they never before managed to beat my own version, QuickJoin.

Martin did come up with a faster approach.  Well, pretty close to, anyway.  With mine and Christian Storm’s help, we managed to fix a few things here and there, and speed his approach up to one that not only beats QuickJoin but also all other methods we could get our hands on.

QuickJoin uses a lot of tricks to speed up the search for nodes to join in the algorithm, but the data structures makes it slow on small data sets and also rather memory hungry.  Martin’s approach is much simpler and this helps it a lot in the small data sets and doesn’t seem to hurt it on the larger data sets.

As for QuickJoin, the trick is to only look at pairs of nodes that can potentially be joined and avoid looking at nodes that we can rule out as the next pair to be joined.

Instead of using quad-trees and various functions to rule out pairs, Simon simply sorts nodes in a way where most likely pairs are considered first, and such that we can recognize when new pairs will not be better than those we have already seen.  Read the actual paper — it is quite easy to understand the algorithm from there — if you want the details.


Simonsen, M., Mailund, T., Pedersen, C.N. Accelerated neighbour-joining. Proceedings of WABI 2008

Heads or tails and reliable alignments

Sunday, March 23rd, 2008

ResearchBlogging.orgI have on several occasions written about the uncertainty inherent in inferred alignments and how this is a potential problem. I hadn’t really thought it would be quite so serious as the results in the paper I just read:

Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments
Giddy Landan and Dan Graur
Molecular Biology and Evolution 2007 24(6):1380-1383; doi:10.1093/molbev/msm060

Abstract

The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.

In this paper they analyse the quality of multiple sequence alignments in an extremely simple manner: They first align the sequences left to right, then reverse them to essentially align them right to left. Unless the alignment algorithm has a preferred order of symbols, you’d expect to get the same alignment going left to right as right to left.

Not always, of course: if the algorithm is based on oligonucleotides or such, then the order matters, but in many cases it doesn’t.

Comparing head and tail alignments

When the order shouldn’t matter, the left-to-right and right-to-left alignments (head and tail alignments in the paper) should be similar, so comparing them should give an indication of how much faith you can have in the inferred alignment.

They try this out on a family of 130 amino acid sequences of length around 400 using three different alignment tools. This is the result:

  ClustalW MUSCLE ProbCons
Columns 18.0% 8.7% 6.7%
Residue pairs 52.1% 53.7% 60.8%
Shared splits 64.6% 65.4% 59.1%

Here Columns denotes the fraction of identical columns in the alignment, Residue pairs denote the fraction of pairs (in a “sum of pairs” kind of way) that are identical, and Shared splits denote the fraction of identical splits (edges) in BioNJ inferred trees from the two alignments.Very few alignment columns are shared between the two alignments, but that is not that much of a problem. With 130 sequences you wouldn’t expect to match many columns exactly. I’m more surprised that the resulting pairwise alignments (the pairwise alignments you get by extracting two rows from the alignment, the identity given in Residue pairs) were so different.It is also a bit shocking that inferred trees from the two alignments were so different.

What is causing this?

There is uncertainty in inferring alignments, but why would the same algorithm give different results when running left-to-right compared to right-to-left?

As far as I can see, there are two different things going on here. One having to do with there being more than one optimal alignment (also discussed in the paper), and one having to do with heuristics in searching for optimal alignments.

When there are more than one optimal alignment (which is often the case), even algorithms guaranteed to find an optimal alignment will give you and arbitrary one (though usually a deterministic arbitrary choice). The arbitrary choice can easily differ between running left-to-right or right-to-left.

For multiple sequence alignments, it is computational infeasible to guarantee to compute an optimal alignment, and heuristics are used to search for (near or locally) optimal alignments. This is often some variation on a greedy strategy, and each choice there will potentially lead to a different alignment. It is easy to see how left-to-right and right-to-left alignments can be different with such a strategy.

In any case, the take home message is, once again: don’t trust alignments!


Landan, G., Graur, D. (2007). Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution, 24(6), 1380-1383. DOI: 10.1093/molbev/msm060