Man, today has been a complete waste.
I've spent the entire day trying to extract relevant data from binary files.
Changing file formats
I started the day working on an association mapping project, we we recently got some new data from deCODE as part of a replication study. The data is in a binary used internally at deCODE, so to analyse it I needed to convert it into the format we use, SNPFile.
We've written a few programs for doing that over the last few years. The data we get is usually almost, but not quite, in the same format as we are used to. It is only very rarely we can just use our converter just like it is. This wasn't one of those rare occations.
Actually, accessing the genotype data isn't that difficult, the problems are always with secondary data. Like which population each individual belongs to. This we get in a text file that we need to parse up and merge with the binary data. How we do this vary from time to time.
Anyway, the file I had to process today contined two new markers to add to a data set we worked on a few months back. I thought I could just extract the information with the program from last time, since I needed all the same secondary data as before, and had it in the same text file, but nothing is ever that easy.
In the new file, five of the individuals from the previous data set were missing. So I had to figure out which were missing and merge the data with those individuals containing missing information on the two new markers.
A very simple thing to do if you script it, but this is binary data with only a C++ API.
I'm seriously considering writing a Python API for it just to make it easier for me in the future, 'cause I've been hacking with this C++ API for simple scripting tasks every time I've received new data...
Well, if my collaboration with deCODE continues after the current project finishes (and we are finishing it right now) I think I will.
Bit rot
The afternoon I've spent analysing some data I last looked at December 2007. It is a mix of a population genetics and association mapping study where we've combined SNP data from the HapMap populations with a Danish cancer study.
Managing the data was a hassle so I converted it all into our SNPFile format with all the population and phenotype data stored with our meta data framework.
That solved a lot of my problems for the analysis then, but now I need to do a little more analysis suggested by a couple of reviewers, so I'm looking at it again.
There is a major problem. The type information we store in our meta data framework, that enables us to access the data through Python, was added to the framework after I made these files.
So I have all the data I need stored in these files, but I do not know the type of the data and I cannot extract it from scripts.
So I've had to figure out all this info again, either by reading through my old code to figure out the types, or by going to the original data and getting it there.
I really hate going back to old data. It is much worse than going back to old code. With the code, I always keep track of changes through version control software, but with data I am just too sloppy.
--
PS. "Post score" 7-2=5