Data management…

Man, today has been a complete waste.

I’ve spent the entire day trying to extract relevant data from binary files.

Changing file formats

I started the day working on an association mapping project, we we recently got some new data from deCODE as part of a replication study. The data is in a binary used internally at deCODE, so to analyse it I needed to convert it into the format we use, SNPFile.

We’ve written a few programs for doing that over the last few years.  The data we get is usually almost, but not quite, in the same format as we are used to. It is only very rarely we can just use our converter just like it is.  This wasn’t one of those rare occations.

Actually, accessing the genotype data isn’t that difficult, the problems are always with secondary data. Like which population each individual belongs to. This we get in a text file that we need to parse up and merge with the binary data. How we do this vary from time to time.

Anyway, the file I had to process today contined two new markers to add to a data set we worked on a few months back.  I thought I could just extract the information with the program from last time, since I needed all the same secondary data as before, and had it in the same text file, but nothing is ever that easy.

In the new file, five of the individuals from the previous data set were missing.  So I had to figure out which were missing and merge the data with those individuals containing missing information on the two new markers.

A very simple thing to do if you script it, but this is binary data with only a C++ API.

I’m seriously considering writing a Python API for it just to make it easier for me in the future, ’cause I’ve been hacking with this C++ API for simple scripting tasks every time I’ve received new data…

Well, if my collaboration with deCODE continues after the current project finishes (and we are finishing it right now) I think I will.

Bit rot

The afternoon I’ve spent analysing some data I last looked at December 2007.  It is a mix of a population genetics and association mapping study where we’ve combined SNP data from the HapMap populations with a Danish cancer study.

Managing the data was a hassle so I converted it all into our SNPFile format with all the population and phenotype data stored with our meta data framework.

That solved a lot of my problems for the analysis then, but now I need to do a little more analysis suggested by a couple of reviewers, so I’m looking at it again.

There is a major problem. The type information we store in our meta data framework, that enables us to access the data through Python, was added to the framework after I made these files.

So I have all the data I need stored in these files, but I do not know the type of the data and I cannot extract it from scripts.

So I’ve had to figure out all this info again, either by reading through my old code to figure out the types, or by going to the original data and getting it there.

I really hate going back to old data.  It is much worse than going back to old code.  With the code, I always keep track of changes through version control software, but with data I am just too sloppy.

PS. “Post score” 7-2=5

Happy New Year

Yeah yeah yeah, I know we are a whole week into the new year and it is a bit late, but that is what this post is about…

It’s been a quiet few months from me now, with weeks between posts. I’ve been working on a new project not kept up to date on my usual interests so I haven’t had much to write about.

Not that it is really a problem that I’m not blogging, but I am getting a bit worried about staying too long out of the loop, so one of my new year resolutions was to blog some more, if for nothing else because it would force me to read some more…

With a week’s delay in wishing a happy new year, we can all see how successful I have been with this resolution.

Okay, since I work best under pressure — some would say I only work under pressure — I am now setting a goal for myself this year: I want to write on average at least one post per day.  That is a lot, so expect some pretty short posts from time to time, but I hope I can still find time for some longer ones.

Okay, so since today is the 7th and this is the first post this year, the “post score” is 7-1 = 6.

Expect more from me soon!