Posts Tagged ‘association mapping’

SNPFile

Thursday, December 11th, 2008

Just breaking my silence again to tell you that this paper of mine got out:

SNPFile – A software library and file format for large scale association mapping and population genetics studies

J. Nielsen and T. Mailund

BMC Bioinformatics 2008, 9:526 doi:10.1186/1471-2105-9-526

Abstract

Background
High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
Results
We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
Conclusions
The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is alleviated by a scripting interfaces that makes it easy to write converters to and from the format.

In the genome wide association studies I’ve been involved with, mainly in the PolyGene project with DeCODE, we needed a file format to handle the massive data we had to analyse.

We needed something that was fast to load and fast to scan through, and we needed to be able to store various meta-data (various co-variates, different phenotypes, gender, …) associated with each individual or each marker (rs#, genomic position, …).

The file format described in the paper is what we came up with.

The primary data — the genotypes — is just a memory mapped matrix stored column wise to make analysis local in memory, which reduces search time when data is on the disk and reduces cache misses when it is in main memory.

For meta-data we have a serialisation framework so we can store any kind of C++ type.

We also wrote a Python API for the file format, which we have used whenever we need some quick conversion of the format or a quick-and-dirty scan of some sort.

We even use it to prototype our new methods now.  At APBC’09 next month, Søren Besenbacher, Christian N.S. Pedersen and myself have  a paper on a method that was implemented in Python and uses this file format for a genome wide scan.

Of course, the serialisation framework for meta data is rather C++ specific, so it was a bit of a hassle to make it play well with Python.  It doesn’t fully do that yet, actually.

What we had to do was to pick the most common types we use and explicitly build Python interfaces for those.  This means that when we are building the Python API we are generating code for all the various combinations of basic types and basic containers.  It makes the compilation really slow, but after it is compiled it works beautifully.

We hide all the different type handles away through a polymorphic interface, of course.

Oh, I could go on about this Python hack, and maybe I will later, but right now I have a cold to nurse so I am off to get a hot cup of tea…

CD/CV and Goldstein

Wednesday, September 17th, 2008

Everyone seems to be talking about this NY Times interview with David B. Goldstein (Gene Sherpas, biomarker-driven mental health, Adaptive Complexity, John Hawks …)

In the proud tradition of blogging, I will add my voice to the noise.

The common disease / common variant hypothesis

The arguments concern association mapping and the so-called Common Disease / Common Variant (CD/CV) hypothesis.  The CD/CV goes like this: a lot of common diseases are late-onset, so we do not expect selection to be strong against the genetic factors underlying them. This, combined with the recent expansion in the human population leads us to expect that a lot of common diseases to be caused by relatively common variants.

If the hypothesis is true, then we should be able to locate these common variants since we can tag all common variants in the genome with relatively few markers, and we can type these using SNP chips.

If the hypothesis is false, then we are screwed. We probably need complete re-sequencing and some heavy duty statistics to get anywhere.

Out of convenience more than anything, people chose to believe the CD/CV to be true, and started projects such as HapMap to map the common variation in the genome.  Based on this map, companies developed chips to tag all variation genome wide, and disease studies used these chips to do genome wide scans.

Goldstein argues:

It takes large, expensive trials with hundreds of patients in different countries to find even common variants behind a disease. Rare variants lie beyond present reach. “It’s an astounding thing,” Dr. Goldstein said, “that we have cracked open the human genome and can look at the entire complement of common genetic variants, and what do we find? Almost nothing. That is absolutely beyond belief.”

If rare variants account for most of the genetic burden of disease, then the idea of decoding everyone’s genome to see to what diseases they are vulnerable to will not work, at least not in the form envisaged. “I don’t believe we should do more and more genomewide association studies for common diseases,” Dr. Goldstein said. Instead, he suggested, the “missing heritability” might be tracked by thoroughly studying the genome of specific patients.

I would say the jury is still out on this one, but it is clear that the CD/CV isn’t as common as it was hyped to be.  We can only explain a small percentage of the heritability of diseases with the variants found so far.  Still, we have discovered more variants that we can replicate within the last year or two than in all the time up to genome wide scans, so writing off genome wide association studies completely is a bit extreme, in my view.

No, CD/CV is not the full story, but some common variants exist, cause we have found them!

The real question is, of course, how much heritability is explained by common variants and how much by rare variants.  Right now, we simply do not know.  The power to detect even common variants is limited, so there might be more out there to find.  On the other hand, it is hard to believe that the vast majority of the heritability is caused by common variants since we still can only explain very little of it, so some rare variants must be involved.

In the coming few years we will probably figure this out, and that is exciting indeed.

Common disease and selection

Now as for variants behind common diseases being selectively (near) neutral — part of why they can be common in the first place — that is an interesting question.

I personally think that selection is playing a larger part in the story of common diseases than we think, and I look forward to learning this story.

Are we seeing common variants because bottlenecks have reduced selection strength so rare variants — otherwise selected against –  have managed to increase in frequency by drift? Are we seeing common variants because they are selected for by some balancing selection? Are they hitch-hiking  on beneficial variants?

We are already hearing about interesting findings in here (Helgason et al. 2007, Blekhman et al. 2008) and we will learn much more in the future.

We live in interesting times indeed, and now is not the time to abandon genome wide association studies.

Simultaneous analysis of all SNPs in a genome-wide association study

Monday, September 15th, 2008

In our association mapping journal club a few weeks back, we discussed this paper (I just never got around to writing down my thoughts on it until now):

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Hoggard, Whittaker, De Iorio and Balding, PLoS Genetics 2008

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

I already heard about the method when I was visiting Imperial College to give a seminar last year, so I am happy that I can finally talk about it.

It is a pretty neat idea.

Regression analysis in association mapping

If you want to figure out which parameters are important for predicting some property, a good old statistical approach is regression analysis.

For a binary property, such as case or control in an association study, you could use logistic regression, but in general you construct some linear function of your parameters and transform them into the “property space” through a link function.

This setup gives you a “model” and depending on the link and the setup you have different ways of interpreting this as a statistical model with a corresponding likelihood function.

The coefficients in the linear combination of parameters are the parameters in the model, and you typically maximize the likelihood with respect to them to get your estimate for them.

In some cases you can directly interpret the parameters, but more often than not you are only interested in knowing whether there is strong evidence in the data that they should be non-zero, i.e. that the parameter in question actually has an effect on the property.

In an association study, you would use your SNPs as your parameters and you consider those SNPs with a non-zero coefficient associated with the disease.

Of course, it is never as simple as that.

Two things complicate matters: your best estimate of a coefficient will never actually be zero, so you want to test if they are significantly different from zero.  Another problem is that you have many more parameters (SNPs) than you have outcomes (individuals), so you will overfit from hell.

Strong “zero” priors

What they do in this paper is both simple and very clever.

They consider the problem in a Bayesian setting and put strong priors on the coefficients, that will tend to keep them at zero unless the signal in the data  is strong enough to pull them away from there.

They then test for association by testing if the mode of the posteriors for these parameters have moved away from zero.

A very nice consequence of this is that you can analyse the entire data at the same time, rather than testing markers individually, which means that if several markers are in LD with a causal marker, you will tend to only pick one of them and recognize that the signal in the others is essentially the same signal.

It also seems quite computationally feasible.  A few hours on a desktop computer to analyse a GWA data set.


Clive J. Hoggart, John C. Whittaker, Maria De Iorio, David J. Balding, Peter M. Visscher (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies PLoS Genetics, 4 (7) DOI: 10.1371/journal.pgen.1000130

New genetic variant discovered, associated with bladder cancer

Monday, September 15th, 2008

I just saw this press release: deCODE and Radboud University Discover Common Variants in the Human Genome Conferring Risk of Bladder Cancer

We, here at BiRC, actually collaborate with both deCODE and Radboud Uni in the EU PolyGene project. The bladder cancer analysis is not part of PolyGene, but through the collaboration we have access to it, and we have just started analysing it.

We weren’t in on the initial analysis, though, so we are not part of this discovery. We only get access to data after they have already mined what they can find themselves. A bit annoying, but perfectly reasonable. Our contribution to the collaboration is methods development, and anything they can find with the methods they already have, they do not really need us for.

Still, it would have been nice to be in on the analysis from the beginning. Whenever we get our hands on the data, we always get excited about hits only to discover that they are already submitted for publication.

Anyway, nice to see that they get something out of the data.

Large scale phasing and imputing in Iceland

Tuesday, September 2nd, 2008

There is a really cool paper in the latest issue of Nature Genetics by people from deCODE:

Detection of sharing by descent, long-range phasing and haplotype imputation
Kong et al.

Nature Genetics 40, 1068 – 1075 (2008); doi: 10.1038/ng.216

Abstract

Uncertainty about the phase of strings of SNPs creates complications in genetic analysis, although methods have been developed for phasing population-based samples. However, these methods can only phase a small number of SNPs effectively and become unreliable when applied to SNPs spanning many linkage disequilibrium (LD) blocks. Here we show how to phase more than 1,000 SNPs simultaneously for a large fraction of the 35,528 Icelanders genotyped by Illumina chips. Moreover, haplotypes that are identical by descent (IBD) between close and distant relatives, for example, those separated by ten meioses or more, can often be reliably detected. This method is particularly powerful in studies of the inheritance of recurrent mutations and fine-scale recombinations in large sample sets. A further extension of the method allows us to impute long haplotypes for individuals who are not genotyped.

As the abstract says, it concerns haplotype phasing and imputation, but the setup is really cool!

The case of Iceland

Iceland is a bit special. The Icelandic population is relatively small (about 300,000) and about 10% of the population has been “genome wide” genotyped at deCODE.

This is a very large fraction of the population, by any standard.

Further, the pedigree of the population is fairly well know from historical records and estimated to be both reasonably complete and reasonably accurate for the last few centuries.

Again, this is rather unique.

Now, this paper introduces a method that exploits these two facts to both impute haplotype phase and impute genotype information for untyped individuals (yes, individuals, not just missing markers!)

Trios and trio proxies

Inferring the haplotype phase of an individual is much simplified if you know the genotypes of his parents.

For a parent-child trio, the homozygotic sites in the parents can be used to infer the phase of the heterozygotic sites in the child. If the child is heterozygotic Aa but the father is homozygotic AA, then clearly the A allele comes from the father.

This simple observation can be used to infer haplotype phase.

It won’t resolve all sites, of course, since it doesn’t help anything at sites heterozygotic in all three, but it does resolve a lot of sites.

Now, typically you do not have trios in an association mapping study. Population based association mapping studies requires to a large degree that the individuals are unrelated, so you would only be able to use the parents anyway, and those are not the ones you can phase this way.

The concept of surrogate parenthood

However, if you have a genealogy for the entire population plus genotyped a large fraction of it, you have a lot of proxies for the parents.

Based on the pedigree you can figure out which typed individuals could possibly be identical by descent (IBD). By also considering which are identical by state (IBS) you can figure out which almost certainly share a haplotype.

Now these individuals can function as surrogate parents for each other. If any surrogate father is homozygotic AA at a site, then the haplotype inherited from the real father has the allele A.

By having several surrogate parents, the real parents need not be typed, and it isn’t a major problem with heterozygotic sites in the parents as long as some surrogate parent is homozygotic at the site.

The relationship between sample size and the yield of LRP

You do need a large fraction of the population genotyped for this to work, though. Perhaps not as much as 10% but a few percent seems to be necessary.

You probably do not need the pedigree to go back several centuries, but a few generations is probably necessary. I do not know how much of the pedigree you can infer directly from the data or if that defeats the purpose…

Inferring missing individuals

A really cool thing they can do based on this method is to impute the haplotypes for individuals not even typed at all.

This is different from imputing missing genotypes, something that has gotten very popular in association mapping the last couple of years and where the idea is that you infer missing markers to test those for association, as an alternative to haplotype association tests.

The idea here is that individuals not typed at all, but present in the pedigree, can have their genotypes inferred.

Now, if you have phenotype information (e.g. disease status) for individuals in the pedigree that you haven’t typed, you would still be able to use them in an association mapping project.

Even if you do not, you could still use them, then you just have to consider your controls as population controls rather than “disease free” controls.

With this approach you might be able to work on data sets with hundreds of thousands of individuals rather than a “mere” tens of thousands.


Augustine Kong, Gisli Masson, Michael L Frigge, Arnaldur Gylfason, Pasha Zusmanovich, Gudmar Thorleifsson, Pall I Olason, Andres Ingason, Stacy Steinberg, Thorunn Rafnar, Patrick Sulem, Magali Mouy, Frosti Jonsson, Unnur Thorsteinsdottir, Daniel F Gudbjartsson, Hreinn Stefansson, Kari Stefansson (2008). Detection of sharing by descent, long-range phasing and haplotype imputation Nature Genetics, 40 (9), 1068-1075 DOI: 10.1038/ng.216