Posts Tagged ‘Xgrid’

Configuring Xgrid … again!

Wednesday, April 28th, 2010

The replacement for my broken office machine came this morning.  I got a nice Mac Pro this time, to get some more computing power to add to our Xgrid.

It’s a rather nice machine, but the screen, although 24″ as the iMac, seems a bit small, though.  Probably just because there is not much of a border around it, so to compensate I connected two screens… which reminds me that I need to go and get a new converter for the display port before I need one for connecting my MacBook to a projector…

It was also a rather nice surprise when iStat Menus showed 16 cores instead of the previous two.

There’s actually only 8 cores (it is two quad core CPUs) but with hyper threading that is what it looks like.

So far so good.

I configured it by extracting everything from my Time Machine backup from the crashed iMac.  That turned out to be a mistake, though.

When I tried to configure it for Xgrid – the reason why I got a Mac Pro rather than another iMac – I ran into trouble.

I need this machine to run a controller (because my iMac ran as the controller for our grid earlier, and the grid had been down since it was smashed), but I just couldn’t start the controller daemon!  It flatly refused to read the database file (/var/xgrid/controller/datastore.db).  I was under the impression that if I deleted this file it would just create a new one, but no such luck for me.  There was absolutely nothing I could do to get it to accept this file (or the absence of it) in the hours I worked with this…

I gave up late afternoon and decided to just reinstall everything from scratch, so I reformatted the disk and installed again.  This time I extracted Applications and Users from Time Machine only (which is all I need anyway), and finally I could start the Xgrid controller.

Now I was ready for the next problem.  Configuring the controller.

I don’t remember exactly how I managed to do this the last time, but I seem to recall that I could do it with Xgrid Admin, so I downloaded that.  I couldn’t set up the authentication that way this time around, though.

As a side note, configuring agents – the machines that can run jobs on the grid – is pretty easy.  It is all built in, and you just go to Sharing > Xgrid, pick a controller and set a password.

There is nothing similar for the controller.  There might be for the Server OS, but I couldn’t find anything on my machine.

For telling the controller which password to use, I found this blog post.  Basically, you need to copy the password file you created when you configured the agent over to the controller.

That just wasn’t enough.

I still needed to tell the controller to actually use password authentication rather than any other option.  Googling for an hour or more finally let me to the file /Library/Preferences/com.apple.xgrid.controller.plist for configuring the controller.  Now I just needed to figure out how to tell it to use password.

In the corresponding file for agents, /Library/Preferences/com.apple.xgrid.agent.plist, there’s the field

<key>ControllerAuthentication</key>
<string>Password</string>

so I tried setting the same in the controller configuration.  That didn’t work, so I tried

<key>AgentAuthentication</key>
<string>Password</string>

and that did the trick.

Finally, the controller was up and running.

My machine, as an agent, only provided four cores to the grid, though, but I knew what to do about that, so I updated the agent configuration to provide 16 cores (there’s really only 8, but with hyperthreading that should probably be considered 16).

As soon as I get the other agents configured with a new controller (the new machine has a different IP address than the old one), our grid should be back up and running.

All in all I wasted an entire day getting this up and running, but without the grid there really isn’t that much of my current data analysis I can get done, so it had to be done.

Running CoalHMM on Xgrid

Monday, October 26th, 2009

Our Linux grid at work is busy with the final CoalHMM analyses of the orangutan genome we are running there plus some analyses of some fungi genomes, and starting next week we are going to run the gorilla genome through our pipeline there, so it will be busy for a while.

Since we have a bunch of Macs around the offices, I figured it would be worthwhile to set up those to run CoalHMM there so I can get started on the bonobo genome that we also want to analyse, so I spent most of today setting up Xgrid for that.

That turned out to be slightly harder than I thought, but I finally worked it out, and this is what I did:

Getting the damn program to run

I decided to use GridStuffer for the job.  It is a nice GUI for Xgrid that takes care of most of the job management that I don’t really want to handle manually.

For that, you need a “meta job file” that just contains the command lines you want to run.  Naively I thought I could just call the CoalHMM tool and see what problems I would run into and then try to fix them, but not so.

Doing that, the jobs just keep failing, but you won’t get any output you can use for debugging.  I don’t think this is GridStuffer specific but just how Xgrid handles executables it can’t deal with.

First I thought that it was because it tried to run the tool with the absolute path – like Xgrid would do with other executables if you just give it that – and since CoalHMM isn’t installed on all the machines that will of course not work.  So I tried various combinations of options to get it to copy the executable with it (like -files and -dirs and combinations of that) but nothing worked.  And no output I could use to debug it.

Frustrated with this, I wrote a wrapper script that Xgrid at least should be able to run and produce error messages, and that worked fine, showing me that indeed it was problems with the executable not being found.

Ok, that is what I expected, so I played some more with the -files option but with little success until I copied the executable to the working directory and used a relative path for the -files option.  Finally the program was copied to the agents that should run the job.

Not that it ran, though.  It was missing a number of dynamic libraries.  So I tried copying those as well, but with little success until I put all the executable and the dynamic libraries in a local directory and added an -in option to the command line to run it there.  That finally worked.  I’m not sure if it can be achieved in other ways, but I can live with that.  Having all the executable code in a directory to be copied to the agents isn’t that much of a problem.

Copying the data files

The next problem was copying the data to the agent.

I have a single big file with a whole genome alignment that I want to analyse, but for the analysis we typically break this down into “chunks” of around 1Mbp that we analyse independently.

Obviously I don’t want to copy the entire alignment just to analyse one megabase, and since I don’t have a shared file system for these Macs I had to split the data into the chunks I want analysed.

Since putting all I need into a single directory and using the -in option worked for running the program, I took the same approach here.  I split the genome alignment into the chunks I need, put each job into a separate directory together with the executable and fire it all off with the -in option.

So far it seems to be working, but I have only tired a few chunks.  The script I use for splitting up the data (and generating the meta job file in the same go) takes quite a while to run (and I’m running it only on my desktop computer) so it was still running when I left the office.

I have an exam tomorrow so I won’t get back to it until Wednesday, but by then I should have the data packaged nicely up into chunks and have meta job file for GridStuffer.  Then I’ll fire it off and see how it goes.

299-317=-18

This limits the usefulness of Xgrid a bit…

Saturday, September 19th, 2009

Ok, I noticed this yesterday but figured it was a configuration issue that I could deal with.  When I run jobs on Xgrid, it runs one job per CPU and not one per core, which for my current use means that I only have half the CPU power compared to manual distribution of jobs.

I read the documentation, and it is supposed to run a job per core, but something is wrong on Snow Leopard and this is apparently a know issue.

I hope this gets fixed before I have a real need for the grid.

262-291=-29

Getting Xgrid up and running

Friday, September 18th, 2009

Ok, I don’t really have a lot of Macs but I’m planning on getting a Mac Pro or two at the office for some of my computations, so I wanted to figure out how Xgrid works so I can use that for those computations.  So I wanted to try it out on my macbook and iMac, just as a proof of principle.

Just getting the grid up and running, I ran into a few problems, so I’m going to write down how I finally manged to get it going here, so I can reproduce it later.

Setting up the grid

Step one, I downloaded the Xgrid Admin tool from here. I had also installed it earlier but without getting around to playing with it, but that installation disappeared with my upgrade to Snow Leopard and I had to install it again.

Starting the tool up, it asks for a controller.  I told it to just use my iMac and gave it a password.  All well and good, but so far no Agents to actually run any jobs.

Step two, I enabled Xgrid in the Sharing Systems Preferences.

SharingUnder Configure I picked the controller I had set up, and again I gave it a password.  Now comes the first problem I ran into.  I mistyped the password here – I wanted the same as for the controller just to make it easier on myself, but got it wrong.  The agent started up fine, didn’t complain about the password or anything, but it didn’t show up as an agent in the Admin tool.

I tried adding it there, but was told it was unavailable.  I mocked about with this for a while but just couldn’t get it to work at all.

It wasn’t until I tried connecting the macbook instead I figured it out.  There I got the password right, and it pop’ed up in the Admin tool.  So I made a wild guess about the password being the problem, retyped it in the Sharing dialogue and now the iMac finally connected as an agent

Agentsand the Admin tool told me I had 5.46 GHz to compute with

OverviewI’m a bit miffed that there was no authentication steps that could have told me what was going wrong, but I guess the trick is to just pick the same password for the controller and all the agents or something like that, ’cause that at least seems to work for me now.

Running a job

To submit jobs, you have to use the xgrid command.

Just running it gives you this:

$ xgrid
xgrid
usage: xgrid <options> <action> <parameters>
Any number of the following <options> may be specified:
 -h[ostname] <hostname-or-IP-address>
 -auth {Password | Kerberos}
 -p[assword] <password>
 -f[ormat] xml
A single <action> and its <parameters> must be specified:
 -grid list
 -grid rename -gid <grid-identifier> <new-name>
 -grid add <grid-name>
 -grid {delete | attributes} -gid <grid-identifier>
 -job list [-gid <grid-identifier>]
 -job {attributes | specification | log | wait} -id <job-identifier>
 -job submit [-gid <grid-identifier>] [-si <stdin>] [-in <indir>] \
 [-dids jobid[,jobid]*] [-email address] \
 [-art <art-path> -artid <art-identifier] [-artequal <art-value>] \
 [-artmin <art-value>] [-artmax <art-value>] <cmd> <arg1> ...
 -job batch [-gid <grid-identifier>] <xml-batch-submission-file>
 -job results -id <job-identifier> [-tid <task-identifier>] \
 [-so <stdout>] [-se <stderr>] [-out <outdir>]
 -job {stop | suspend | resume | delete | restart} -id <job-identifier>
 -job run [-gid <grid-identifier>] [-si <stdin>] [-in <indir>] \
 [-so <stdout>] [-se <stderr>] [-out <outdir>] [-email address] \
 [-art <art-path> -artid <art-identifier] [-artequal <art-value>] \
 [-artmin <art-value>] [-artmax <art-value>] <cmd> <arg1> ...

xgrid -?, or xgrid with no arguments, will print this usage message.

I don’t really know what the options mean, so I tried firing off a few, and I just kept getting the same output.  A bit disappointing.

I guessed that the hostname option was needed, but -hlocalhost just didn’t work for me, but eventually I found out that “-h localhost” would.  Well, not exactly work, but at least it complained that I needed to authenticate the command:

$ xgrid -h localhost -grid list
{
 error = "could not connect to localhost (Authentication failed)";
}

Adding a password with “-p password” did the trick.  Again, you do need the space between -p and the password.

$ xgrid -h localhost -p password -grid list
{
 gridList =     (
 0
 );
}

I don’t know what the output means here, but at least I was making progress.

Asking for a job list (I’m guessing here) gave me an empty list:

$ xgrid -h localhost -p password -job list
{
 jobList =     (
 );
}

which I expected since I haven’t submitted any jobs, so I tried sending a simple “ls” command.

$ xgrid -h localhost -p password -job submit ls
{
 jobIdentifier = 0;
}

In the Xgrid Admin tool I saw that the job had failed, so I figured it could be a path thing and gave it the full path of the job

$ xgrid -h localhost -p password -job submit /bin/ls
{
 jobIdentifier = 1;
}

and that seemed to do the trick:

JobsAsking for a job list with xgrid shows me two jobs

$ xgrid -h localhost -p password -job list{
 jobList =     (
 0,
 1
 );
}

I figured that -job results should give me the states of the jobs, like I could see them in the Admin tool, but I don’t get any output when I run that command, so I don’t know how that is supposed to work.

I can delete a job, though:

$ xgrid -h localhost -p password -job delete -id 0
{
}

Jobs (one deleted)but I still haven’t figured out how to get the status or output of the job from xgrid.

I guess it is time to stop experimenting and read the manual…

Update: Ok, I did just one more experiment.  If I run a program that is guaranteed to give some output I do get that output when I ask for the result.  I tried just running xgrid and I got the help text.  I guess the ls command I tried before was run in an empty directory and that is why it didn’t produce any output.

I still haven’t found a nice way to get the status, but the -job attributes command at least gives me a lot of info about the job including the jobStatus.

I still have some experimenting and reading to do before I get the grid up and running on some of the computations I am actually interested in, but I am optimistic now at least.

261-290=-29