SNPFile

Just breaking my silence again to tell you that this paper of mine got out:

SNPFile – A software library and file format for large scale association mapping and population genetics studies

J. Nielsen and T. Mailund

BMC Bioinformatics 2008, 9:526 doi:10.1186/1471-2105-9-526

Abstract

Background
High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and flat text files were adequate solutions earlier, the increased data size mandates more efficient solutions.
Results
We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.
Conclusions
The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is alleviated by a scripting interfaces that makes it easy to write converters to and from the format.

In the genome wide association studies I’ve been involved with, mainly in the PolyGene project with DeCODE, we needed a file format to handle the massive data we had to analyse.

We needed something that was fast to load and fast to scan through, and we needed to be able to store various meta-data (various co-variates, different phenotypes, gender, …) associated with each individual or each marker (rs#, genomic position, …).

The file format described in the paper is what we came up with.

The primary data — the genotypes — is just a memory mapped matrix stored column wise to make analysis local in memory, which reduces search time when data is on the disk and reduces cache misses when it is in main memory.

For meta-data we have a serialisation framework so we can store any kind of C++ type.

We also wrote a Python API for the file format, which we have used whenever we need some quick conversion of the format or a quick-and-dirty scan of some sort.

We even use it to prototype our new methods now.  At APBC’09 next month, Søren Besenbacher, Christian N.S. Pedersen and myself have  a paper on a method that was implemented in Python and uses this file format for a genome wide scan.

Of course, the serialisation framework for meta data is rather C++ specific, so it was a bit of a hassle to make it play well with Python.  It doesn’t fully do that yet, actually.

What we had to do was to pick the most common types we use and explicitly build Python interfaces for those.  This means that when we are building the Python API we are generating code for all the various combinations of basic types and basic containers.  It makes the compilation really slow, but after it is compiled it works beautifully.

We hide all the different type handles away through a polymorphic interface, of course.

Oh, I could go on about this Python hack, and maybe I will later, but right now I have a cold to nurse so I am off to get a hot cup of tea…

Tags: , ,

14 Responses to “SNPFile”

  1. Roald Forsberg Says:

    “off to get a hot cup of tea” – on a Friday!?

    Don’t you think you have spent too much time in England the last couple of years?

  2. Mailund Says:

    Hey, the post was written on a Thursday ;-)

  3. gioby Says:

    Looks interesting!!
    I will read your article.
    Looking forward for the other part.

  4. gioby Says:

    So, your article is interesting, and your library could be useful for me.

    However, I have some questions:
    - what is the difference between your binary format and other binary formats used to handle big quantities of data? For example, HDF5 or the one used by the ROOT framework?
    - Your article is very complete, however it doesn’t say which kind of tests you have applied to test your format. If I want to use it, how can I be sure that it doesn’t introduce errors or lose data?
    As often happens, it is the fault of the editor of the journal, because they never publish data on tests, while it would be useful.

  5. Thomas Mailund Says:

    Hi gioby,

    A main difference between SNPFile and those formats you mention is that SNPFile is explicitly designed to store genetic data whereas the others are more general frameworks. So we can optimize our data storage to the access patterns typical for our use. To which degree you can do that with the other frameworks, I don’t really know.

    As for the testing, the unit tests are provided with the code. Of course there are no guarantees that there are no bugs (I am willing to guarantee that there is) and no guarantees that you will not lose data. This is academic software provided for free, so that is the deal, as always.

    We have used it for the last two years with very few problems, that is all I can say :)

  6. gioby Says:

    I was proposing to adopt your binary format to the biopython mailing list.
    The population genetics module in biopython is in its earlier phases of development, so I thought it could be a good idea to propose its adoption or at least to include an handler for your format.

    Why don’t you have a look at the discussion on the mailing list?
    - http://lists.open-bio.org/pipermail/biopython/2008-December/004830.html

  7. Thomas Mailund Says:

    Thanks, I’ll have a look at the discussion :)

  8. gioby Says:

    You are welcome :)

    p.s. it seems that BMC Bioinformatics hasn’t published the full article yet. It says that ‘the fully formatted PDF and HTML versions are in production’, so I can’t access to any supplementary data.
    Moreover, the link posted in the article (http://www.daimi.au.dk/∼mailund/SNPFile/) doesn’t seem to work (it says ‘Page not found’).
    So nothing, there is no way to see the code.

  9. Thomas Mailund Says:

    The link is correct, but I think there is something wrong with the server. I cannot access other parts of my home page right now :(

  10. Thomas Mailund Says:

    No, wait, I see the problem with the URL: the ∼ is not the same character as ~. If you’ve cut the link from the PDF, it uses the wrong tilde.

  11. gioby Says:

    thanks!
    It works: I can confirm that the deb package installs correctly on an Ubuntu Hardy 8.04.

  12. gioby Says:

    thanks!
    It works: I can confirm that the deb package installs correctly on an Ubuntu Hardy 8.04.
    It seems nice to have the parsers to convert it from other existing formats like fastPhase etc.

  13. Thomas Mailund Says:

    We wrote the converters before we made the Python interface. These days, I would write such things as scripts instead.

    Sorry to confess, we haven’t written a manual for the Python interface yet :-(

  14. gua sha Says:

    I really liked the way they came off

Leave a Reply

CAPTCHA Image CAPTCHA Audio
Refresh Image