Just breaking my silence again to tell you that this paper of mine got out:
J. Nielsen and T. Mailund
BMC Bioinformatics 2008, 9:526 doi:10.1186/1471-2105-9-526
High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is alleviated by a scripting interfaces that makes it easy to write converters to and from the format.
In the genome wide association studies I've been involved with, mainly in the PolyGene project with DeCODE, we needed a file format to handle the massive data we had to analyse.
We needed something that was fast to load and fast to scan through, and we needed to be able to store various meta-data (various co-variates, different phenotypes, gender, ...) associated with each individual or each marker (rs#, genomic position, ...).
The file format described in the paper is what we came up with.
The primary data -- the genotypes -- is just a memory mapped matrix stored column wise to make analysis local in memory, which reduces search time when data is on the disk and reduces cache misses when it is in main memory.
For meta-data we have a serialisation framework so we can store any kind of C++ type.
We also wrote a Python API for the file format, which we have used whenever we need some quick conversion of the format or a quick-and-dirty scan of some sort.
We even use it to prototype our new methods now. At APBC'09 next month, Søren Besenbacher, Christian N.S. Pedersen and myself have a paper on a method that was implemented in Python and uses this file format for a genome wide scan.
Of course, the serialisation framework for meta data is rather C++ specific, so it was a bit of a hassle to make it play well with Python. It doesn't fully do that yet, actually.
What we had to do was to pick the most common types we use and explicitly build Python interfaces for those. This means that when we are building the Python API we are generating code for all the various combinations of basic types and basic containers. It makes the compilation really slow, but after it is compiled it works beautifully.
We hide all the different type handles away through a polymorphic interface, of course.
Oh, I could go on about this Python hack, and maybe I will later, but right now I have a cold to nurse so I am off to get a hot cup of tea...