Yesterday, Bioinformatics accepted this paper of mine:
Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data
Z. Ding, T. Mailund and Y.S. Song
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.
It is an extension of our Blossoc method. Blossoc is an association mapping method that constructs perfect phylogenies (trees compatible with the genotypes) along the genome and tests if these trees tend to cluster cases and controls. If they do, we see it as evidence for local genotypes associated with the disease.
In the original Blossoc paper we used a very simple algorithm for inferring the local phylogenies. This algorithm only works for phased data, however, and real data is never phased. The phase has to be inferred, and that is more time consuming than the actual association test.
In the new paper we use an algorithm that can construct the phylogenies directly from unphased data.
One problem, that we haven't quite solved yet, is that we are working with perfect phylogenies -- trees that exactly matches the genotypes -- and such trees are only possible to construct for very short intervals in genotype data. At least with genome wide association study data with thousands of individuals.
We still have some hacks to get around that, but those only work with phased data again, so even with our new method we fall back on inferring the phased information -- although only locally -- before we build some of the trees.
This is, by far, the most time consuming part of the algorithm right now, so I hope in the future we can improve on this. Maybe construct "nearly perfect phylogenies" or something from unphased data. I'm not sure how to get there, but I think it is worth researching...