Major association mapping software release

Today I am releasing new versions of about half my association software. It’s been a while since I released new versions of any of these tools, and in the mean time they’ve been more and more integrated making it harder to release them independently. Now, since we needed to use them all up in Iceland last we visited DeCODE — myself in December and three other from my group in January — we needed to get all the software synchronized anyway, so I wanted to take that opportunity to make a major release.

I had planned to make the release close to New Year, so I code-named it the New Year release. I think I should re-name it to the Chinese New Year release. That is close enough that I can defend it.

Of course, it is even closer to the next planned release — the Happy Birthday release — that was supposed to come out tomorrow (at my birthday, of course). That release is likely to be delayed a couple more weeks, though, but I am sticking to the code name.

By the way, you can see the road-map for that here.

The software release consists of the following:

SNPFile logoSNPFile — a library and API for manipulating large SNP datasets with associated meta-data, such as marker names, marker locations, individuals’ phenotypes, etc. in an I/O efficient binary file format. Version 2.0 adds a completely new serialization framework for storing meta-data. The previous one — based on Boost serialization — wasn’t binary compatible across platforms, the new one is. We also add a Python module for manipulation of SNPFiles, version 1.0 of that.

SMA logoSMA — tools for single marker association tests. Currently there are three tools, two for case/control data and one for quantitative traits. Version 1.2 extends the tools with options for doing both genotype and allelic (additive) tests.

Blossoc logoBlossoc — BLOck aSSOCiation. Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast. Blossoc can therefore handle much larger datasets than more sophisticated tools, but at the cost of sacrificing some accuracy. Version 1.3 adds methods for scanning for quantitative traits and is tightly integrated with SNPFile.

HapCluster logoHapCluster — a Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium mapping, described in details in:

Fine Mapping of Disease Genes via Haplotype Clustering. E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

a tool I develop in collaboration with David Balding’s group at Imperial College London. Version 2.2 is basically just integration with SNPFile 2.0. The next major development of HapCluster is what I have planned for the Happy Birthday release.

MCMC diagnostics for phylgenies

I often use Markov Chain Monte Carlo (MCMC) methods in my research, but I still treat it a bit like magic. Sometimes it works great and sometimes getting it to mix or converge in reasonable time is just near impossible. The papers and textbooks I’ve read on the topic more often than not teaches me tricks that work on real numbers or vectors in Euclidian space, but the typical setting for me is a discrete state space (or a mix of continuous and discrete parameters) and I cannot find much in the literature to help me out with that.

Just something as simple as checking convergence of a chain, or estimating the effective sample size, is giving me problems.

Just now I saw a tool that could have helped me earlier, had it existed at the time:

AWTY (are we there yet?): a system for graphical expolration of MCMC convergence in Bayesian phylogenetics
Nylander et al.
Bioinformatics 2008 24(4):581-583; doi:10.1093/bioinformatics/btm388

Of course, I would be happier with an R package or Python module than what looks like an unholy mix of scripts, but beggars can’t be choosers.

The SBML discrete stochastic models test suite

ResearchBlogging.org

Good test suites (and benchmark suites) are important for software (and model) development, but can be pretty hard to come up with for stochastic models or software relying on probabilistic algorithms.In this issue of Bioinformatics there is an application note describing such a test suite:

The SBML discrete stochastic models test suite
Evans, Gillesipe and Wilkinson
Bioinformatics 2008 24(2):285-286.

Their approach is pretty simple: compare your simulations with the expected value and see if it falls outside the expected range (from known or previously simulated sd). As such, there is not really that to it. The models are also pretty simple, so while they will be useful for catching obvious bugs in a general simulator, I am not sure they will help catching more complex bugs. Of course, you do need to have the simple stuff working before you can tackle the harder problems, so it could still be useful.

Anyway, I am not planning on implementing a general SBML simulator, so it is not that much use to me, except that I am teaching a Systems Biology class where we are using Wilkinson’s book so the models in the test suite matches the exercises I am giving my students, and I can use the test suite to test their programming. Neat.


Citation, for Research Blogging:Evans, T.W., Gillespie, C.S., Wilkinson, D.J. (2007). The SBML discrete stochastic models test suite. Bioinformatics, 24(2), 285-286. DOI: 10.1093/bioinformatics/btm566

And it just keeps getting cheaper…

Yesterday I mentioned how the price keeps dropping on genome re-sequencing, and already today I spot yet another post (on CLC Bio’s “Next Generation Sequencing” blog) on the very topic. Now VisiGen will start selling genome re-sequencing for $1000 — what I would consider the price at which you will start using re-sequencing rather than SNP typing — at the end of 2009.

Will we see the first re-sequencing association mapping studies in 2010?

Joining the DNA Network

Today I received an email from Hsien-Hsien Lei from Eye on DNA, inviting me to join the DNA Network (read a description of this network here). So now I have.

The DNA Headlines

I wasn’t even aware that this existed, but I am happy to learn about it. There is lots of interesting blogs there to follow (and I am a bit addicted to blog reading).

I am just a bit nervous if I have anything much to contribute to all this, but hey, if you don’t like my blog just stop reading it, right?