Large scale phasing and imputing in Iceland

There is a really cool paper in the latest issue of Nature Genetics by people from deCODE:

Detection of sharing by descent, long-range phasing and haplotype imputation
Kong et al.

Nature Genetics 40, 1068 – 1075 (2008); doi: 10.1038/ng.216


Uncertainty about the phase of strings of SNPs creates complications in genetic analysis, although methods have been developed for phasing population-based samples. However, these methods can only phase a small number of SNPs effectively and become unreliable when applied to SNPs spanning many linkage disequilibrium (LD) blocks. Here we show how to phase more than 1,000 SNPs simultaneously for a large fraction of the 35,528 Icelanders genotyped by Illumina chips. Moreover, haplotypes that are identical by descent (IBD) between close and distant relatives, for example, those separated by ten meioses or more, can often be reliably detected. This method is particularly powerful in studies of the inheritance of recurrent mutations and fine-scale recombinations in large sample sets. A further extension of the method allows us to impute long haplotypes for individuals who are not genotyped.

As the abstract says, it concerns haplotype phasing and imputation, but the setup is really cool!

The case of Iceland

Iceland is a bit special. The Icelandic population is relatively small (about 300,000) and about 10% of the population has been “genome wide” genotyped at deCODE.

This is a very large fraction of the population, by any standard.

Further, the pedigree of the population is fairly well know from historical records and estimated to be both reasonably complete and reasonably accurate for the last few centuries.

Again, this is rather unique.

Now, this paper introduces a method that exploits these two facts to both impute haplotype phase and impute genotype information for untyped individuals (yes, individuals, not just missing markers!)

Trios and trio proxies

Inferring the haplotype phase of an individual is much simplified if you know the genotypes of his parents.

For a parent-child trio, the homozygotic sites in the parents can be used to infer the phase of the heterozygotic sites in the child. If the child is heterozygotic Aa but the father is homozygotic AA, then clearly the A allele comes from the father.

This simple observation can be used to infer haplotype phase.

It won’t resolve all sites, of course, since it doesn’t help anything at sites heterozygotic in all three, but it does resolve a lot of sites.

Now, typically you do not have trios in an association mapping study. Population based association mapping studies requires to a large degree that the individuals are unrelated, so you would only be able to use the parents anyway, and those are not the ones you can phase this way.

The concept of surrogate parenthood

However, if you have a genealogy for the entire population plus genotyped a large fraction of it, you have a lot of proxies for the parents.

Based on the pedigree you can figure out which typed individuals could possibly be identical by descent (IBD). By also considering which are identical by state (IBS) you can figure out which almost certainly share a haplotype.

Now these individuals can function as surrogate parents for each other. If any surrogate father is homozygotic AA at a site, then the haplotype inherited from the real father has the allele A.

By having several surrogate parents, the real parents need not be typed, and it isn’t a major problem with heterozygotic sites in the parents as long as some surrogate parent is homozygotic at the site.

The relationship between sample size and the yield of LRP

You do need a large fraction of the population genotyped for this to work, though. Perhaps not as much as 10% but a few percent seems to be necessary.

You probably do not need the pedigree to go back several centuries, but a few generations is probably necessary. I do not know how much of the pedigree you can infer directly from the data or if that defeats the purpose…

Inferring missing individuals

A really cool thing they can do based on this method is to impute the haplotypes for individuals not even typed at all.

This is different from imputing missing genotypes, something that has gotten very popular in association mapping the last couple of years and where the idea is that you infer missing markers to test those for association, as an alternative to haplotype association tests.

The idea here is that individuals not typed at all, but present in the pedigree, can have their genotypes inferred.

Now, if you have phenotype information (e.g. disease status) for individuals in the pedigree that you haven’t typed, you would still be able to use them in an association mapping project.

Even if you do not, you could still use them, then you just have to consider your controls as population controls rather than “disease free” controls.

With this approach you might be able to work on data sets with hundreds of thousands of individuals rather than a “mere” tens of thousands.

Augustine Kong, Gisli Masson, Michael L Frigge, Arnaldur Gylfason, Pasha Zusmanovich, Gudmar Thorleifsson, Pall I Olason, Andres Ingason, Stacy Steinberg, Thorunn Rafnar, Patrick Sulem, Magali Mouy, Frosti Jonsson, Unnur Thorsteinsdottir, Daniel F Gudbjartsson, Hreinn Stefansson, Kari Stefansson (2008). Detection of sharing by descent, long-range phasing and haplotype imputation Nature Genetics, 40 (9), 1068-1075 DOI: 10.1038/ng.216

Google Chrome

I was just told about this post.  Apparently, Google is developing its own browser.

It is described in this cartoon that features people I know here from Google in Aarhus such as Lars Bak and Kasper Verdich.

I didn’t know what they were working on.  A lot of them were working on virtual machines for mobile phones and such before going Google, so I thought it was something like, but apparently not.

It’s still virtual machines, though, from what I get from the cartoon.