Running CoalHMM on Xgrid

Our Linux grid at work is busy with the final CoalHMM analyses of the orangutan genome we are running there plus some analyses of some fungi genomes, and starting next week we are going to run the gorilla genome through our pipeline there, so it will be busy for a while.

Since we have a bunch of Macs around the offices, I figured it would be worthwhile to set up those to run CoalHMM there so I can get started on the bonobo genome that we also want to analyse, so I spent most of today setting up Xgrid for that.

That turned out to be slightly harder than I thought, but I finally worked it out, and this is what I did:

Getting the damn program to run

I decided to use GridStuffer for the job.  It is a nice GUI for Xgrid that takes care of most of the job management that I don’t really want to handle manually.

For that, you need a “meta job file” that just contains the command lines you want to run.  Naively I thought I could just call the CoalHMM tool and see what problems I would run into and then try to fix them, but not so.

Doing that, the jobs just keep failing, but you won’t get any output you can use for debugging.  I don’t think this is GridStuffer specific but just how Xgrid handles executables it can’t deal with.

First I thought that it was because it tried to run the tool with the absolute path – like Xgrid would do with other executables if you just give it that – and since CoalHMM isn’t installed on all the machines that will of course not work.  So I tried various combinations of options to get it to copy the executable with it (like -files and -dirs and combinations of that) but nothing worked.  And no output I could use to debug it.

Frustrated with this, I wrote a wrapper script that Xgrid at least should be able to run and produce error messages, and that worked fine, showing me that indeed it was problems with the executable not being found.

Ok, that is what I expected, so I played some more with the -files option but with little success until I copied the executable to the working directory and used a relative path for the -files option.  Finally the program was copied to the agents that should run the job.

Not that it ran, though.  It was missing a number of dynamic libraries.  So I tried copying those as well, but with little success until I put all the executable and the dynamic libraries in a local directory and added an -in option to the command line to run it there.  That finally worked.  I’m not sure if it can be achieved in other ways, but I can live with that.  Having all the executable code in a directory to be copied to the agents isn’t that much of a problem.

Copying the data files

The next problem was copying the data to the agent.

I have a single big file with a whole genome alignment that I want to analyse, but for the analysis we typically break this down into “chunks” of around 1Mbp that we analyse independently.

Obviously I don’t want to copy the entire alignment just to analyse one megabase, and since I don’t have a shared file system for these Macs I had to split the data into the chunks I want analysed.

Since putting all I need into a single directory and using the -in option worked for running the program, I took the same approach here.  I split the genome alignment into the chunks I need, put each job into a separate directory together with the executable and fire it all off with the -in option.

So far it seems to be working, but I have only tired a few chunks.  The script I use for splitting up the data (and generating the meta job file in the same go) takes quite a while to run (and I’m running it only on my desktop computer) so it was still running when I left the office.

I have an exam tomorrow so I won’t get back to it until Wednesday, but by then I should have the data packaged nicely up into chunks and have meta job file for GridStuffer.  Then I’ll fire it off and see how it goes.


Last week in the blogs

Academic life





Last week (and a bit) in the blogs


Human evolution