Replicating haplotype findings
Tuesday, August 26th, 2008I have a small problem.
We have analysed some cancer data from DeCODE as part of the association mapping project PolyGene. We used Blossoc for this and we found some candidate regions worth examining further.
We have access to samples from Spain and the Netherlands, and we want to try to replicate the findings there. Now the problem is how to choose a strategy for replication.
Blossoc is a haplotype method that tries to infer the local genealogy in a region and then examines the clustering of phenotypes on this genealogy. The problem with such an approach is that you really need an entire region to replicate to try to do the same trick in the replication population. This means typing a lot of markers in the replication sample (expensive) and potentially correcting for a lot of tests (reducing power). It is not really the way to go.
We extended Blossoc to output what it considers the most important SNPs in the genealogy inference in each interesting region. This should contain the most important SNPs in the regions for the replication, and gave us 2-6 SNPs per candidate region (with only 43 SNPs all in all for three diseases, so not a small reduction).
We have typed these SNPs in the replication population, but now we need to figure out how to try to replicate the findings with only that.
It goes without saying that we need to decide exactly what to test for based on the original data. If we start searching for significant signals in the new data we are no longer replicating but data trawling and the risk of false positives drastically increases.
I have a program for listing all haplotype patterns in a data set and testing them for association, and I can run that on the old data to pick the patterns to test for in the new data. There is a tradeoff, though, between association scores and the complexity of the pattern. There is bound to be some overfitting in the old data, and we want to avoid that in the patterns to replicate.
It is a tricky problem…