Posts Tagged ‘population genetics’

On segment lengths, going back in time, in the coalescence process. Part 3: Segment lengths of MRCAs of two species

Sunday, December 27th, 2009

A while ago I wrote the first two posts in this “series” on the coalescent process and how genomic segments behave as a dynamic system back in time.  Read Part 1 (about a single sequence) here and Part 2 (about two sequences) here.  I wanted to write quite a bit more about it, but what happened was that I managed to translate it into a method for population genomics and I was busy working on that and writing a paper about it so I left the blogging for a while.  The paper is now under review, and I have time to write about some of the ideas here again.

As you may recall, we can think of the ancestry of a genomic sequence as a dynamic system that, when run back in time, splits up into different lineages through recombination and then merge again through coalescence events.  This is just the classical coalescence process with recombination, and the result of running it is the so-called Ancestral Recombination Graph, or ARG.

We can use this model both at the population level, but also at the inter-species level (where the ancestries just go back a lot longer in time).

The mathematics is fairly easy to deal with when simulating — it is a relatively simple continuous time Markov chain (CMTC) — but because the state space is rather large (infinite if you consider a continuous segment as the genomic segment), inference is rather difficult as is working out some of the properties of the system.  For example the different properties I wrote about in Part 2.

Markov approximations

In the work we are doing here at BiRC on “ancestral population genomics” we work around this problem by making a simplifying assumption.  We assume that the process is not only a Markov chain in time, but also along the genomic segment.  What this means is that instead of dealing with the enormous — or infinite — state space of the coalescent with recombination, we can consider only neighbouring pairs of nucleotides, work out their dynamics, and extrapolate from there using the Markov assumption.

In Part 1, I described a simple example of this, when considering the dynamics of a single genome going back in time, and described how the Markov assumption is incorrect, but still reasonable in approximating the coalescent process.

In this post I am going to approximate the ancestry of two (haploid) genomes through a CTMC modeling two neighbouring nucleotides from each of the genomes.

The states of the system is shown below:

CTMC statesEssentially the states captures all combinations of a left nucleotide being linked with a right nucleotide (a black line or no line) between the two sequences (top black dots and bottom black dots) and the nucleotides having found a most common recent ancestor (MRCA, the white dots) or not (the black dots).

You can move between the states through recombination events — that remove the links between right and left nucleotides — or coalescence events — that either adds a link between a left and right nucleotide or joins two left of two right nucleotides in their MRCA.

The rate matrix for the system is shown below, where the blank cells are zero entries and the “-” cells contain -1 times the row sums, as usual.

Rate matrix

Here C is the coalescent rate and R the recombination rate.  Typically, we use a time scale in coalescent theory of 2 N_e generations per time unit in which case C would be 1.  If N_e is 10,000 as for humans, then a recombination rate of 1 cM/Mb would be a rate of R=0.004 on that time scale.

When considering two genomes back in time, we start in one of the states in \Omega_B — the “beginning states”.  If we have two haploid genomes and we start the system at time zero, we would probably start in state 4 where both the genomes are linked, but if we consider an inter-species system we would get some probability distribution over the states in \Omega_B from running the single sequence system from Part 1.

The states in \Omega_L (or \Omega_R) corresponds to states where only the left (or only the right) nucleotides have found their MRCA while the right (or left) have not.  The “end states”, \Omega_E contains the two states where both left and right nucleotides have found their MRCA, and is essentially just the single sequence system from Part 1.

Running the system back in time, the probability of being in one of these four classes of states evolves as this:

Dynamics of the CTMCBeing in either \Omega_L or \Omega_R is always somewhat unlikely, since it requires that a recombination has decoupled a left and right nucleotide before finding a MRCA, and it is more likely that the left and right nucleotides find their MRCA in the same coalescent event, moving directly from \Omega_B to \Omega_E.

Deep and shallow coalescent segments

With a finite state CTMC as the one above, we can compute pretty much everything of interest, and there is an extensive theory for how to manipulate these systems.  If we want to say anything about the ancestry of genomes, though, we have to remember that the system only deals with two nucleotides, which in anyone’s book is quite a bit shorter than a genome.

To get from a pair of nucleotides to a full genome, we use the Markov assumption.  When modeling n nucleotides, we assume that we can model the ancestry of number n conditional on the previous n-1 by only conditioning on number n-1, and then we can compute the probabilities of those ancestries using our two-nucleotide CTMC.

For the rest of this post, I’ll give an example of this, and then give other examples in following posts.  When the paper I have under review now is out, I’ll also explain how this CTMC can be used to construct a CoalHMM for population genomics.

Anyway, the example: Let us consider the human-chimp-gorilla trio.  We know that these three species are very closely related, to the point where there is extensive incomplete lineage sorting.  This happens when we have deep coalescences, compared to the speciation times.  If the human and chimp lineages have not found their MRCA before the gorilla speciation (when considering the process back in time), then it is possible that the human coalesces with the gorilla before the chimp, giving us a ((human,gorilla),chimp) genealogy, and similarly for chimp and gorilla coalescing before reaching the human.

Some things about this system are easy to work out from the coalescence process. Working out the probabilities for deep coalescences and incomplete lineage sorting is pretty simple.  The coalescence process has a simple exponential distribution, so figuring out the probability that the human and chimp have not coalesced before the gorilla speciation (as a function of effective population sizes and speciation times) is straightforward.

If the lineages have not coalesced before the gorilla speciation, then there are three possible topologies for the genealogy, ((human,chimp),gorilla), ((human,gorilla),chimp) and ((chimp,gorilla),human), each equally likely, and only one of which, ((human,chimp),gorilla), is congruent with the species tree.

Congruence graphSo, e.g. to work out the probability of seeing a gene tree equal to the species tree, you just need to work out the probability that human and chimp coalesce before or after the speciation time.  The gene tree is congruent if they do, or 1/3 of the times when they do not.

Congruence equationAll in all pretty simple, but it is only telling us how many nucleotides we expect to see congruent with the species tree, not anything about how they are distributed along the genome.

Here, I am going to consider the slightly simpler question of segments above or below the speciation time.  It is straightforward to work out the probability of seeing a nucleotide above \Pr(T>\tau) or below \Pr(T\leq\tau) the speciation time, but what is the pattern of segments above or below along the genome?

Let’s call a contiguous segment of nucleotides that all coalesce further back than the speciation gorilla time a “deep coalescence segment” and a contiguous segment of nucleotides that all coalesce before the gorilla speciation a “shallow coalescence segment”.

Oh, by the way, before above means closer to the present … it is confusing, I know, but time runs backwards in coalescence theory.

We can use the Markov approximation to work out the mean length of deep and shallow coalescence segments as a function of the speciation time.

Under the Markov approximation, we can argue as follows: For a deep coalescence, consider a left nucleotide in a deep state.  The nucleotide to the right of it is either also deep, or it is shallow.  If it is deep, then the next one can be deep or shallow, and so on.  Under the Markov approximation, the probability of moving from deep to shallow is the same in each such step, and the length of a deep segment is given by the waiting time for moving from deep to shallow, a waiting time that is geometric distributed.

For the CTMC states, we have the left nucleotide in a deep state if we are in one of the states in the classes \Omega_L or \Omega_B at the speciation time, while we have the left nucleotide in a shallow state if we are in one of the states int he classes \Omega_R or \Omega_E.

If the left nucleotide is in a deep state, then the right is also in a deep state with probability \frac{\Pr(\Omega_E)}{\Pr(\Omega_R)+\Pr(\Omega_E)} while the right is in a shallow state with probability \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)}.  Similarly for when the left nucleotide is in a shallow state.

The mean length of a deep coalescence segment is then 1 / \frac{\Pr(\Omega_R)}{\Pr(\Omega_R)+\Pr(\Omega_E)} while the mean length of a shallow coalescence segment is  1 / \frac{\Pr(\Omega_L)}{\Pr(\Omega_L)+\Pr(\Omega_B)}.

Very easy to compute with the CTMC, but pretty hard from the coalescence process.

Of course, it is only an approximation, so we should worry a bit about how accurate it is.  I’ve simulated some data sets using the coalescence process, using a human-chimp divergence of 4.5 mya, a present day effective population size of humans and chimps of 10,000 or 50,000 and an effective population size of the human-chimp ancestor of 50,000 or 100,000.  The plots below shows the distribution of deep and shallow segments as the boxplots, the mean length of these as the green bullet, and the expected mean lengths as computed above as the blue line.

10K-50K10K-100K50K-50K50K-100KAll in all, it is a pretty good fit, so the approximation looks okay.

If we assume that the present day Ne of humans and chimps is 20,000 (a bit high for humans and a bit low for chimps, but on average okay) and that the human-chimp ancestral Ne is 50,000 (as often estimated) then with a gorilla branching off 1 milion years before the human-chimp speciation, we expect that segments that coalesce deeper than the speciation are ~116bp while segments that coalesce before are ~76bp.

Of course, this is only on average, and the true picture is more complex, but it does tell us that the genomic relationship between these three species is somewhat complex with a lot of short genomic fragments with MRCA in the human-chimp ancestral species or with MRCA in the shared African ape ancestor.

Coolest paper in a while, and I feel left out…

Wednesday, March 25th, 2009

This paper seems to be the coolest population genetics out in a while:

I have downloaded it, but I just haven’t had time to read it yet, and probably won’t have time to read it until the weekend.  I’m swamped with work, and I have three journal clubs Thursday and Friday, so the reading I do have time for goes to the papers for those.

I just can’t believe that I am postponing reading this paper to read about fungi genomics, of all things…

84-102=-18

New paper out

Wednesday, March 4th, 2009

We just got a new paper out yesterday in BMC Medical Genetics:

Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups

Schierup et al.

BMC Medical Genetics 2009, 10:20 doi:10.1186/1471-2350-10-20

Abstract

Background

A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.

Methods

We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.

Results

We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.

Conclusions

The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.

Mikkel and I got involved in the project to try to use our haplotype based association mapping methods to analyse data where a single marker analysis had already shown an association with several kinds of cancer.

We didn’t really discover anything new when running our tools on the data, so to try something else we combined the case/control data with HapMap data to try to increase the number of markers through imputation.

That is when we discovered that a haplotype in the region, that is found in about 50% of Europeans (CEU and our case/control data) is only found in ~1% of Africans (YRI).  Furthermore, this haplotype was the at-risk haplotype in our case/control data and looks to be the derived haplotype when compared with the chimp genome.

Reference

Mikkel H Schierup, Thomas Mailund, Heng Li, Jun Wang, Anne Tjonneland, Ulla Vogel, Lars Bolund, Bjorn A Nexo (2009). Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups BMC Medical Genetics, 10 (1) DOI: 10.1186/1471-2350-10-20

63-82=-19

How much selection is going on in humans?

Saturday, January 17th, 2009

A priori we expect that most mutations, by far, have no consequence on fitness, while some have a negative effect and very few have a positive effect.  Consequently, we can generally ignore selection when analysing genomic sequences.

However, over the last few years a number of papers have suggested that adaptive (positive) selection has played a major role in shaping the human genome. That is, genome-wide there are signals that shows patterns of selection. So perhaps we shouldn’t be so quick to ignore it.

Yesterday in PLoS Genetics there is another paper arguing this:

Pervasive Hitchhiking at coding and regulatory sites in humans

Cai et al. PLoS Genetics 5(1)

Abstract

Much effort and interest have focused on assessing the importance of natural selection, particularly positive natural selection, in shaping the human genome. Although scans for positive selection have identified candidate loci that may be associated with positive selection in humans, such scans do not indicate whether adaptation is frequent in general in humans. Studies based on the reasoning of the MacDonald–Kreitman test, which, in principle, can be used to evaluate the extent of positive selection, suggested that adaptation is detectable in the human genome but that it is less common than in Drosophila or Escherichia coli. Both positive and purifying natural selection at functional sites should affect levels and patterns of polymorphism at linked nonfunctional sites. Here, we search for these effects by analyzing patterns of neutral polymorphism in humans in relation to the rates of recombination, functional density, and functional divergence with chimpanzees. We find that the levels of neutral polymorphism are lower in the regions of lower recombination and in the regions of higher functional density or divergence. These correlations persist after controlling for the variation in GC content, density of simple repeats, selective constraint, mutation rate, and depth of sequencing coverage. We argue that these results are most plausibly explained by the effects of natural selection at functional sites—either recurrent selective sweeps or background selection—on the levels of linked neutral polymorphism. Natural selection at both coding and regulatory sites appears to affect linked neutral polymorphism, reducing neutral polymorphism by 6% genome-wide and by 11% in the gene-rich half of the human genome. These findings suggest that the effects of natural selection at linked sites cannot be ignored in the study of neutral human polymorphism.

Selection and variation

Neutral mutations are expected to behave differently from non-neutral mutations mainly in their chance to get fixed and the time it takes them to get fixed in a population.  Neutral mutations that gets fixed are expected to have taken a number of generations linear in the effective population size, while mutants under selection that gets fixed are expected to have taken a logarithmic number of generations. That goes for both positive and negative selection, but for different reasons.

If we consider a region and assume that there is no recombination going on, and we assume that a new mutation appears here destined to get fixed in the population.  When it gets fixed, all individuals in the population will be descendent from the individual that first carried the mutation.  They will not be identical at the region, though, ’cause new mutations will have accumulated in the time it took the mutation to get fixed.

The amount of variation in the population will depend on how quickly the mutation got fixed.  If it happened very slow, we expect much variation, and if it happened very rapidly, we expect little variation.

That, combined with the expected time to fixation for neutral and selected mutants gives us a pattern to look for that distinguishes between neutral evolution and selection.

Variation, selection and recombination

When recombination is going on, we expect a slightly different pattern.

If there is selection on several mutations in the region, it gets pretty complicated.  At least I haven’t managed to quite get my head around the details yet, but I’ll refer you to this book: Population Genetics of Multiple Loci by Freddy Christiansen.

I will just assume that there is a single mutation under selection.

In that case, the pattern really is very similar.  We don’t expect to see reduced variation in the entire region around the mutation, but instead we expect reduced variation close to the mutation site — where few recombination events have occurred while the mutation got fixed — and increased variation up to the neutral level as we move away from the mutation site — where more and more recombinations have uncoupled the mutant from sites further away.

Selection in humans

It is this kind of patterns they look for in the PLoS Genetics paper.

A consequence of all of the above is that, assuming lots of selection is going on, we expect a positive correlation between variation and recombination sites, and a negative correlation between variation and sites we a priori expect to be functional (like genes).

This is exactly what they find.

There are a few more details to it, of course.  Density of functional sites, recombination and mutation rates are not independent, so we could see exactly the same pattern just from the correlation with neutral mutation rates, so they need to correct for this.

Essentially, though, it is these patterns — expected assuming selection but not assuming neutrality — that they find.


James J. Cai, J. Michael Macpherson, Guy Sella, Dmitri A. Petrov (2009). Pervasive Hitchhiking at Coding and Regulatory Sites in Humans PLoS Genetics, 5 (1) DOI: 10.1371/journal.pgen.1000336
17-30=-13

Ancient DNA analysis of the Icelandic settlers

Friday, January 16th, 2009

I’ve just finished reading this paper in PLoS Genetics:

Sequences From First Settlers Reveal Rapid Evolution in Icelandic mtDNA Pool Helgason et al. PLoS Genetics, 5 (1) DOI: 10.1371/journal.pgen.1000343

Abstract

A major task in human genetics is to understand the nature of the evolutionary processes that have shaped the gene pools of contemporary populations. Ancient DNA studies have great potential to shed light on the evolution of populations because they provide the opportunity to sample from the same population at different points in time. Here, we show that a sample of mitochondrial DNA (mtDNA) control region sequences from 68 early medieval Icelandic skeletal remains is more closely related to sequences from contemporary inhabitants of Scotland, Ireland, and Scandinavia than to those from the modern Icelandic population. Due to a faster rate of genetic drift in the Icelandic mtDNA pool during the last 1,100 years, the sequences carried by the first settlers were better preserved in their ancestral gene pools than among their descendants in Iceland. These results demonstrate the inferential power gained in ancient DNA studies through the application of population genetics analyses to relatively large samples.

The paper has already been discussed by MacArthur at Genetic Future and Razib at Gene Expression so I will only give a very short review here.

Iceland was settled in the late first millennium by Vikings (replacing settlements by Irish monks) bringing with them Irish and Scottish slaves.

Iceland has been pretty isolated since the early settlement. Consequently, it is a very homogeneous population — at least genetically speaking — and is one of the best studied populations, not least by deCODE Genetics.

Analysis of contemporary DNA shows that the mitrocondrial DNA (mtDNA) is primarily of Scottish/Irish decent while Y chromosomes are primarily Scandinavian. You can probably draw your own conclusions from that.

What is new in the PLoS paper is that they have sequenced ancient mtDNA from medieval skeletons and compared them with contemporary samples from Iceland, Scandinavia and Scotland and Ireland.

The data doesn’t change much with respect to the genetic origin of the Icelanders, but an interesting finding is that the medieval Icelanders are genetically closer related to present day samples from the source populations than they are to the present day Icelanders.

In other words, the Icelanders have evolved faster.

No, this doesn’t mean that they are mutating faster or that selection has had a hand in this.  It can quite easily be explained by the isolation of the relatively small population.

Drift, one of the driving forces of evolution, simply works much faster on small (effective) population sizes than it does on larger sizes.

Drift is essentially a question of random sampling, and with smaller effective population sizes the sampling is more random than in larger effective populations.  This is very well explained in Razib’s post, so I will direct you there instead of repeating the arguments here.


Agnar Helgason, Carles Lalueza-Fox, Shyamali Ghosh, Sigrún Sigurðardóttir, Maria Lourdes Sampietro, Elena Gigli, Adam Baker, Jaume Bertranpetit, Lilja Árnadóttir, Unnur Þorsteinsdottir, Kári Stefánsson (2009). Sequences From First Settlers Reveal Rapid Evolution in Icelandic mtDNA Pool PLoS Genetics, 5 (1) DOI: 10.1371/journal.pgen.1000343

16-27=-11