Posts Tagged ‘DeCODE’

Bad news for deCODE?

Wednesday, August 12th, 2009

Oh this looks pretty bad.  I knew the economic situation was bad at deCODE but I thought it was improving.

I refuse to think that deCODE just closes down.  If you follow the link above you will find a press release from Kari Stefansson saying that they want to focus on diagnostics (personal genetics, probably).  I'm not sure how realistic that is, though.  I have no idea how well deCODEme is running.

If all else fails they could probably keep running as a private research institution if they can pull in the grants.

If they do close down it would be such a pity.  They have been insanely productive in genetics research and probably have much much more to offer.

Plus, I just submitted a grant proposal together with deCODE, so on a personal level I want them to survive as well ;)

--

224-226=-2

Last week in the blogs

Monday, April 6th, 2009

Another week, another list...

Blogging

DeCODE

Evolution

Open science

Phylogenetic inference

Programming

--

96-120=-24

New genetic variant discovered, associated with bladder cancer

Monday, September 15th, 2008

I just saw this press release: deCODE and Radboud University Discover Common Variants in the Human Genome Conferring Risk of Bladder Cancer

We, here at BiRC, actually collaborate with both deCODE and Radboud Uni in the EU PolyGene project. The bladder cancer analysis is not part of PolyGene, but through the collaboration we have access to it, and we have just started analysing it.

We weren't in on the initial analysis, though, so we are not part of this discovery. We only get access to data after they have already mined what they can find themselves. A bit annoying, but perfectly reasonable. Our contribution to the collaboration is methods development, and anything they can find with the methods they already have, they do not really need us for.

Still, it would have been nice to be in on the analysis from the beginning. Whenever we get our hands on the data, we always get excited about hits only to discover that they are already submitted for publication.

Anyway, nice to see that they get something out of the data.

Large scale phasing and imputing in Iceland

Tuesday, September 2nd, 2008

There is a really cool paper in the latest issue of Nature Genetics by people from deCODE:

Detection of sharing by descent, long-range phasing and haplotype imputation
Kong et al.

Nature Genetics 40, 1068 - 1075 (2008); doi: 10.1038/ng.216

Abstract

Uncertainty about the phase of strings of SNPs creates complications in genetic analysis, although methods have been developed for phasing population-based samples. However, these methods can only phase a small number of SNPs effectively and become unreliable when applied to SNPs spanning many linkage disequilibrium (LD) blocks. Here we show how to phase more than 1,000 SNPs simultaneously for a large fraction of the 35,528 Icelanders genotyped by Illumina chips. Moreover, haplotypes that are identical by descent (IBD) between close and distant relatives, for example, those separated by ten meioses or more, can often be reliably detected. This method is particularly powerful in studies of the inheritance of recurrent mutations and fine-scale recombinations in large sample sets. A further extension of the method allows us to impute long haplotypes for individuals who are not genotyped.

As the abstract says, it concerns haplotype phasing and imputation, but the setup is really cool!

The case of Iceland

Iceland is a bit special. The Icelandic population is relatively small (about 300,000) and about 10% of the population has been "genome wide" genotyped at deCODE.

This is a very large fraction of the population, by any standard.

Further, the pedigree of the population is fairly well know from historical records and estimated to be both reasonably complete and reasonably accurate for the last few centuries.

Again, this is rather unique.

Now, this paper introduces a method that exploits these two facts to both impute haplotype phase and impute genotype information for untyped individuals (yes, individuals, not just missing markers!)

Trios and trio proxies

Inferring the haplotype phase of an individual is much simplified if you know the genotypes of his parents.

For a parent-child trio, the homozygotic sites in the parents can be used to infer the phase of the heterozygotic sites in the child. If the child is heterozygotic Aa but the father is homozygotic AA, then clearly the A allele comes from the father.

This simple observation can be used to infer haplotype phase.

It won't resolve all sites, of course, since it doesn't help anything at sites heterozygotic in all three, but it does resolve a lot of sites.

Now, typically you do not have trios in an association mapping study. Population based association mapping studies requires to a large degree that the individuals are unrelated, so you would only be able to use the parents anyway, and those are not the ones you can phase this way.

The concept of surrogate parenthood

However, if you have a genealogy for the entire population plus genotyped a large fraction of it, you have a lot of proxies for the parents.

Based on the pedigree you can figure out which typed individuals could possibly be identical by descent (IBD). By also considering which are identical by state (IBS) you can figure out which almost certainly share a haplotype.

Now these individuals can function as surrogate parents for each other. If any surrogate father is homozygotic AA at a site, then the haplotype inherited from the real father has the allele A.

By having several surrogate parents, the real parents need not be typed, and it isn't a major problem with heterozygotic sites in the parents as long as some surrogate parent is homozygotic at the site.

The relationship between sample size and the yield of LRP

You do need a large fraction of the population genotyped for this to work, though. Perhaps not as much as 10% but a few percent seems to be necessary.

You probably do not need the pedigree to go back several centuries, but a few generations is probably necessary. I do not know how much of the pedigree you can infer directly from the data or if that defeats the purpose...

Inferring missing individuals

A really cool thing they can do based on this method is to impute the haplotypes for individuals not even typed at all.

This is different from imputing missing genotypes, something that has gotten very popular in association mapping the last couple of years and where the idea is that you infer missing markers to test those for association, as an alternative to haplotype association tests.

The idea here is that individuals not typed at all, but present in the pedigree, can have their genotypes inferred.

Now, if you have phenotype information (e.g. disease status) for individuals in the pedigree that you haven't typed, you would still be able to use them in an association mapping project.

Even if you do not, you could still use them, then you just have to consider your controls as population controls rather than "disease free" controls.

With this approach you might be able to work on data sets with hundreds of thousands of individuals rather than a "mere" tens of thousands.


Augustine Kong, Gisli Masson, Michael L Frigge, Arnaldur Gylfason, Pasha Zusmanovich, Gudmar Thorleifsson, Pall I Olason, Andres Ingason, Stacy Steinberg, Thorunn Rafnar, Patrick Sulem, Magali Mouy, Frosti Jonsson, Unnur Thorsteinsdottir, Daniel F Gudbjartsson, Hreinn Stefansson, Kari Stefansson (2008). Detection of sharing by descent, long-range phasing and haplotype imputation Nature Genetics, 40 (9), 1068-1075 DOI: 10.1038/ng.216

Replicating haplotype findings

Tuesday, August 26th, 2008

I have a small problem.

We have analysed some cancer data from DeCODE as part of the association mapping project PolyGene. We used Blossoc for this and we found some candidate regions worth examining further.

We have access to samples from Spain and the Netherlands, and we want to try to replicate the findings there. Now the problem is how to choose a strategy for replication.

Blossoc is a haplotype method that tries to infer the local genealogy in a region and then examines the clustering of phenotypes on this genealogy. The problem with such an approach is that you really need an entire region to replicate to try to do the same trick in the replication population. This means typing a lot of markers in the replication sample (expensive) and potentially correcting for a lot of tests (reducing power). It is not really the way to go.

We extended Blossoc to output what it considers the most important SNPs in the genealogy inference in each interesting region. This should contain the most important SNPs in the regions for the replication, and gave us 2-6 SNPs per candidate region (with only 43 SNPs all in all for three diseases, so not a small reduction).

We have typed these SNPs in the replication population, but now we need to figure out how to try to replicate the findings with only that.

It goes without saying that we need to decide exactly what to test for based on the original data. If we start searching for significant signals in the new data we are no longer replicating but data trawling and the risk of false positives drastically increases.

I have a program for listing all haplotype patterns in a data set and testing them for association, and I can run that on the old data to pick the patterns to test for in the new data.  There is a tradeoff, though, between association scores and the complexity of the pattern.  There is bound to be some overfitting in the old data, and we want to avoid that in the patterns to replicate.

It is a tricky problem...