Grid workflow system

The last week, since last Saturday afternoon, I've worked on and off on a small utility for specifying workflows on our computer grid here at BiRC.

I'm used to using Make files to keep track of workflows and making sure that the files I'm manipulating are up to date, but now I need to run my workflows on computer grids more and more - with the size of data I work on these days it is just not feasible to run it on my desktop - and that doesn't quite work with Make.

You can of course use Make to keep track of time stamps and such, but when running jobs you want to submit them to be run in parallel and you need them to be scheduled so you know that the dependent tasks are done before you start running the next set of jobs.

I tried to google around for something to solve that problem but couldn't find anything - perhaps my google-fu just isn't strong enough - but I figured it couldn't be that hard to write it myself, so I did.

After all, it just boils down to specifying some tasks, figuring out which input files they need and which output files they produce, and then building a dependency graph between tasks. The tasks can then be submitted to the computer grid with their dependencies and the queue system takes care of the rest.

I programmed up the dependency graph and the specification language last weekend and the beginning of this week, and then the latter half of the week and this weekend I rewrote the scripts for my current projects to use the workflow system.

Now I have some simple workflow files for my projects and I can submit jobs, with dependencies, with a single shell command. If I need to see which commands will be run, I can run the submission as a "dry run" that shows what will be run, or I can get a list of tasks with status (showing what is up to date or what needs to be run and why), and I can even get a graph showing all the tasks using Graphviz.

After writing a bit of documentation I will find some testers for a beta release, and I hope others will find it as useful as I am finding it right now ...

You can get the code at githup.

3 Responses to “Grid workflow system”

  1. mw Says:

    There's tons of tools doing this, but not really a generic library. Lars and I did something like that for state spaces (though we didn't get around to distribute it, it was parallel; we called it JoSEL because Lars wouldn't let me name it).

    We're doing something similar for ProM right now (http://www.youtube.com/watch?v=j-niwu2mBXs) and has a phd student who is supposed to work on it. It gets trickier when you add loops.

    Ronny Mans knows a bunch of tools for this (r.s.mans@tue add .nl).

    I'll forward this to the guys so we can see about making it work together.

  2. Thomas Mailund Says:

    my google-fu is weak :) I suspected that this is a problem that has been solved tons of times, but couldn't find it.

  3. Thomas Mailund Says:

    For bioinformatics, Galaxy does it very well as well. What I needed was just a bit more low-level.