Now how exactly was it I did that?
RRResearch has some thoughts about keeping records of computer work:
When I do benchwork I consistently keep pretty good notes. I write down everything I do as I do it, on numbered and dated sheets of paper that go into looseleaf binders, organized by experiment.
…
But I don’t seem to be able to apply these good record-keeping habits when I’m working with computers. Instead everything I do feels ‘exploratory’, as if everything I do is just a preliminary check to see what effect a modification will have, before I do something worth writing down.
I recognise this all too well.
It is not so much a problem when I do some exploratory data analysis. I will have my R log to see what I actually did, and if I find an interesting pattern I know what I found and I don’t really need the history of how I got there so much.
When writing programs I don’t have the problem either. There I have source control and bug trackers to help me.
My problem is with scripts.
I write a small script to format my data into something I can analyse. Run a program or two on the data. Write another script to re-format the data. A small script to pull out relevant data. Look at that. Then I need to just check a few things, and that is easy as another little script.
Very soon I have ten to twenty small scripts of five to ten lines each. None of them are really worth putting in version control or cleaning up or anything, ’cause it was all just exploratory anyway, but if I come back to the data a few weeks later, I have no way of reproducing what I did.
It is really horrible.
Ideally, once I know what I want to do with the data, I should clean up the pipeline, put it under version control and document it, but by then I am already done with the data analysis so I rarely bother.
Until I have to do it all again a few weeks or month later on some new data.
At that point I should really clean up the pipeline, but most likely I need to do something slightly different. Not drastically different, but a few of the steps should be modified anyway, and depending on the results I need a few more scripts and it just spirals out of control.
I don’t really know how to solve this, I only know that what I am doing is quite sub-optimal.
–
134-146=-12
May 14th, 2009 at 9:55 am
I use git for version control, because it is very easy to setup (doesn’t require to install a server), and allows me to keep a backup on github.
To keep track of the scripts I run, I make great use of Makefiles.
If you keep under version control both the makefile, the scripts, and the result data, you can easily look at the history and know which commands you have run to your data and which was the state of every script when you did it.
Sometimes ago I wrote a small slideshow on make.. Have a look at it if you want:
- http://bioinfoblog.it/2009/03/seminar-on-makefiles-in-bioinformatics/
May 14th, 2009 at 10:00 am
[quote]Very soon I have ten to twenty small scripts of five to ten lines each. None of them are really worth putting in version control or cleaning up or anything, ’cause it was all just exploratory anyway, but if I come back to the data a few weeks later, I have no way of reproducing what I did.[/quote]
Really, this is something you can easily solve with makefiles. I had the same problem before learning it.
you should write simple makefiles without worrying of prerequisites and file names as targets.
For example:
debug_data:
ipython -pylab -i -c ‘import mymodule; mymodule.calculate_stats()’
analyze_data:
python get_data.py -option1 -option2
all: analyze_data
May 14th, 2009 at 1:21 pm
I have used Makefiles extensively, but the problem here is that I typically do not have a pipeline to run my data through until I have finished the data analysis, so a Makefile wouldn’t really help me :)
I write the script, run it, and never look back… until I need to do something similar weeks down the line…
May 17th, 2009 at 5:16 pm
Aw, man, I have wasted more time than I care to think about trying to reconstruct something I did four weeks ago and didn’t keep good record of.