A map of recent selection in humans


I am currently involved in a study where we have a gene showing both disease association and high differentiation between Africans and Europeans/Asians (as far as we can see from HapMap data). Sorry, I cannot give more details right now.

Anyway, because of this study I finally got around to reading this paper:

A Map of Recent Positive Selection in the Human Genome

Voight BF, Kudaravalli S, Wen X, Pritchard JK.
PLoS Biology 2007 4(3): e72 doi:10.1371/journal.pbio.0040072


The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest ~250 signals of recent selection in each population.

I knew of the results already from a talk by Jonathan Pritchard that I attended this summer, but I hadn’t read the paper until now.

The idea is pretty neat: by looking at the haplotypes around a SNP, and how they break down with distance from the SNP, you can spot which SNPs have changed rapidly from low frequency to higher frequency and these SNPs are candidates for being under selection.

This is illustrated nicely in Figure 1 from the paper:

Break-down of haplotypes around a SNP


A) Decay of haplotypes in a single region in which a new selected allele (red, center column) is sweeping to fixation, replacing the ancestral allele (blue). Horizontal lines are haplotypes; SNP positions are marked below the haplotype plot using blue for SNPs with intermediate allele frequencies (minor allele >0.2), and red otherwise. For a given SNP, adjacent haplotypes with the same color carry identical genotypes everywhere between that SNP and the central (selected) site. The left- and right-hand sides are sorted separately. Haplotypes are no longer plotted beyond the points at which they become unique.

B) Decay of haplotype homozygosity for ten replicate simulations. When the core SNP is neutral (σ = 0; left side) the haplotype homozygosity decays at similar rates for both ancestral and derived alleles. When the derived alleles are favored (σ = 2Ns = 250; right side), the haplotype homozygosity decays much slower for the derived alleles than for the ancestral alleles. The discrepancy in the overall areas spanned by these two curves forms the basis of our text for selection (iHS).

The citation was (for the benefit of Research Blogging):
Voight, B., Kudaravalli, S., Wen, X., Pritchard, J. (2006). A Map of Recent Positive Selection in the Human Genome. PLoS Biology, 4(3).

A deCODEme add

I just saw this deCODEme add on Eye on DNA:

It was a bit funny seeing the place I’ve visited so often on YouTube, and to see people I’ve met up there describe deCODEme. (I’ve had some discussions with Agnar and talked to Hakon a few times, but my work when I’m visiting DeCODE is not involving them so I do not know them that well, but still).

My work there has nothing to do with deCODEme, though, but with genome wide association mapping. You can read about it on the PolyGene homepage.

“Identical” twins

Now there’s a study that shows that identical (monozygotic) twins do not have identical genomes (I spotted it here at DNA Direct talk — I’m getting a lot of science news now that I follow the DNA network).

The genomes are pretty close, but not identical. There seem to be a lot of structural variation between them.

I guess it doesn’t surprise me all that much, even if it looks like a major discovery. Considering that the cells within an individual have almost but not quite identical genomes, I would be very surprised if twins’ genomes were identical.

For reading about the somatic cell differences, this is an excellent paper:

Genomic Variability within an Organism Exposes Its Cell Lineage Tree

Frumkin D, Wasserstrom A, Kaplan S, Feige U, Shapiro E

Genomic Variability within an Organism Exposes Its Cell Lineage Tree. PLoS Comput Biol 1(5): e50 doi:10.1371/journal.pcbi.0010050


What is the lineage relation among the cells of an organism? The answer is sought by developmental biology, immunology, stem cell research, brain research, and cancer research, yet complete cell lineage trees have been reconstructed only for simple organisms such as Caenorhabditis elegans. We discovered that somatic mutations accumulated during normal development of a higher organism implicitly encode its entire cell lineage tree with very high precision. Our mathematical analysis of known mutation rates in microsatellites (MSs) shows that the entire cell lineage tree of a human embryo, or a mouse, in which no cell is a descendent of more than 40 divisions, can be reconstructed from information on somatic MS mutations alone with no errors, with probability greater than 99.95%. Analyzing all ~1.5 million MSs of each cell of an organism may not be practical at present, but we also show that in a genetically unstable organism, analyzing only a few hundred MSs may suffice to reconstruct portions of its cell lineage tree. We demonstrate the utility of the approach by reconstructing cell lineage trees from DNA samples of a human cell line displaying MS instability. Our discovery and its associated procedure, which we have automated, may point the way to a future “Human Cell Lineage Project” that would aim to resolve fundamental open questions in biology and medicine by reconstructing ever larger portions of the human cell lineage tree.

The applications for analysing genetic diseases that the researchers mention still makes this an interesting result, if only you can find sufficent twins with one affected and one unaffected twin…

Major association mapping software release

Today I am releasing new versions of about half my association software. It’s been a while since I released new versions of any of these tools, and in the mean time they’ve been more and more integrated making it harder to release them independently. Now, since we needed to use them all up in Iceland last we visited DeCODE — myself in December and three other from my group in January — we needed to get all the software synchronized anyway, so I wanted to take that opportunity to make a major release.

I had planned to make the release close to New Year, so I code-named it the New Year release. I think I should re-name it to the Chinese New Year release. That is close enough that I can defend it.

Of course, it is even closer to the next planned release — the Happy Birthday release — that was supposed to come out tomorrow (at my birthday, of course). That release is likely to be delayed a couple more weeks, though, but I am sticking to the code name.

By the way, you can see the road-map for that here.

The software release consists of the following:

SNPFile logoSNPFile — a library and API for manipulating large SNP datasets with associated meta-data, such as marker names, marker locations, individuals’ phenotypes, etc. in an I/O efficient binary file format. Version 2.0 adds a completely new serialization framework for storing meta-data. The previous one — based on Boost serialization — wasn’t binary compatible across platforms, the new one is. We also add a Python module for manipulation of SNPFiles, version 1.0 of that.

SMA logoSMA — tools for single marker association tests. Currently there are three tools, two for case/control data and one for quantitative traits. Version 1.2 extends the tools with options for doing both genotype and allelic (additive) tests.

Blossoc logoBlossoc — BLOck aSSOCiation. Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast. Blossoc can therefore handle much larger datasets than more sophisticated tools, but at the cost of sacrificing some accuracy. Version 1.3 adds methods for scanning for quantitative traits and is tightly integrated with SNPFile.

HapCluster logoHapCluster — a Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium mapping, described in details in:

Fine Mapping of Disease Genes via Haplotype Clustering. E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

a tool I develop in collaboration with David Balding’s group at Imperial College London. Version 2.2 is basically just integration with SNPFile 2.0. The next major development of HapCluster is what I have planned for the Happy Birthday release.

MCMC diagnostics for phylgenies

I often use Markov Chain Monte Carlo (MCMC) methods in my research, but I still treat it a bit like magic. Sometimes it works great and sometimes getting it to mix or converge in reasonable time is just near impossible. The papers and textbooks I’ve read on the topic more often than not teaches me tricks that work on real numbers or vectors in Euclidian space, but the typical setting for me is a discrete state space (or a mix of continuous and discrete parameters) and I cannot find much in the literature to help me out with that.

Just something as simple as checking convergence of a chain, or estimating the effective sample size, is giving me problems.

Just now I saw a tool that could have helped me earlier, had it existed at the time:

AWTY (are we there yet?): a system for graphical expolration of MCMC convergence in Bayesian phylogenetics
Nylander et al.
Bioinformatics 2008 24(4):581-583; doi:10.1093/bioinformatics/btm388

Of course, I would be happier with an R package or Python module than what looks like an unholy mix of scripts, but beggars can’t be choosers.