Posts Tagged ‘data management’

Old code + old data = death by a thousand cuts

Thursday, January 8th, 2009

Again today I am struggling with data files I worked on a year ago.

My current API and library doesn't like the old files, so I've tried checking out old code from my source code repository.

In theory that should work, however at the time I was doing this analysis I had several versions of my libraries installed, 'cause we were in the middle of developing the new version of SNPFile, and I don't know which versions of the tools, linked to which version of the libraries, I was using...

I really need more data discipline!

On a different project we are desperately looking for log files from a data filtering script.  We need some info that we probably should have stored with the data, but that we didn't think about at the time and so we didn't.  From the filtering script we can see that we can reconstruct it from the log files from the script, but we cannot find these logs.

Chances are that they are not backed up either, since they were probably stored with the primary data which we store on separate drives -- there are gigs and gigs of it -- and these are not backed up, since we have the primary data backed up elsewhere.

We really need more data discipline!

--

Post score: 8-3 = 5

Data management...

Wednesday, January 7th, 2009

Man, today has been a complete waste.

I've spent the entire day trying to extract relevant data from binary files.

Changing file formats

I started the day working on an association mapping project, we we recently got some new data from deCODE as part of a replication study. The data is in a binary used internally at deCODE, so to analyse it I needed to convert it into the format we use, SNPFile.

We've written a few programs for doing that over the last few years.  The data we get is usually almost, but not quite, in the same format as we are used to. It is only very rarely we can just use our converter just like it is.  This wasn't one of those rare occations.

Actually, accessing the genotype data isn't that difficult, the problems are always with secondary data. Like which population each individual belongs to. This we get in a text file that we need to parse up and merge with the binary data. How we do this vary from time to time.

Anyway, the file I had to process today contined two new markers to add to a data set we worked on a few months back.  I thought I could just extract the information with the program from last time, since I needed all the same secondary data as before, and had it in the same text file, but nothing is ever that easy.

In the new file, five of the individuals from the previous data set were missing.  So I had to figure out which were missing and merge the data with those individuals containing missing information on the two new markers.

A very simple thing to do if you script it, but this is binary data with only a C++ API.

I'm seriously considering writing a Python API for it just to make it easier for me in the future, 'cause I've been hacking with this C++ API for simple scripting tasks every time I've received new data...

Well, if my collaboration with deCODE continues after the current project finishes (and we are finishing it right now) I think I will.

Bit rot

The afternoon I've spent analysing some data I last looked at December 2007.  It is a mix of a population genetics and association mapping study where we've combined SNP data from the HapMap populations with a Danish cancer study.

Managing the data was a hassle so I converted it all into our SNPFile format with all the population and phenotype data stored with our meta data framework.

That solved a lot of my problems for the analysis then, but now I need to do a little more analysis suggested by a couple of reviewers, so I'm looking at it again.

There is a major problem. The type information we store in our meta data framework, that enables us to access the data through Python, was added to the framework after I made these files.

So I have all the data I need stored in these files, but I do not know the type of the data and I cannot extract it from scripts.

So I've had to figure out all this info again, either by reading through my old code to figure out the types, or by going to the original data and getting it there.

I really hate going back to old data.  It is much worse than going back to old code.  With the code, I always keep track of changes through version control software, but with data I am just too sloppy.

--

PS. "Post score" 7-2=5