Archive for September 18th, 2009

Getting Xgrid up and running

Friday, September 18th, 2009

Ok, I don’t really have a lot of Macs but I’m planning on getting a Mac Pro or two at the office for some of my computations, so I wanted to figure out how Xgrid works so I can use that for those computations.  So I wanted to try it out on my macbook and iMac, just as a proof of principle.

Just getting the grid up and running, I ran into a few problems, so I’m going to write down how I finally manged to get it going here, so I can reproduce it later.

Setting up the grid

Step one, I downloaded the Xgrid Admin tool from here. I had also installed it earlier but without getting around to playing with it, but that installation disappeared with my upgrade to Snow Leopard and I had to install it again.

Starting the tool up, it asks for a controller.  I told it to just use my iMac and gave it a password.  All well and good, but so far no Agents to actually run any jobs.

Step two, I enabled Xgrid in the Sharing Systems Preferences.

SharingUnder Configure I picked the controller I had set up, and again I gave it a password.  Now comes the first problem I ran into.  I mistyped the password here – I wanted the same as for the controller just to make it easier on myself, but got it wrong.  The agent started up fine, didn’t complain about the password or anything, but it didn’t show up as an agent in the Admin tool.

I tried adding it there, but was told it was unavailable.  I mocked about with this for a while but just couldn’t get it to work at all.

It wasn’t until I tried connecting the macbook instead I figured it out.  There I got the password right, and it pop’ed up in the Admin tool.  So I made a wild guess about the password being the problem, retyped it in the Sharing dialogue and now the iMac finally connected as an agent

Agentsand the Admin tool told me I had 5.46 GHz to compute with

OverviewI’m a bit miffed that there was no authentication steps that could have told me what was going wrong, but I guess the trick is to just pick the same password for the controller and all the agents or something like that, ’cause that at least seems to work for me now.

Running a job

To submit jobs, you have to use the xgrid command.

Just running it gives you this:

$ xgrid
xgrid
usage: xgrid <options> <action> <parameters>
Any number of the following <options> may be specified:
 -h[ostname] <hostname-or-IP-address>
 -auth {Password | Kerberos}
 -p[assword] <password>
 -f[ormat] xml
A single <action> and its <parameters> must be specified:
 -grid list
 -grid rename -gid <grid-identifier> <new-name>
 -grid add <grid-name>
 -grid {delete | attributes} -gid <grid-identifier>
 -job list [-gid <grid-identifier>]
 -job {attributes | specification | log | wait} -id <job-identifier>
 -job submit [-gid <grid-identifier>] [-si <stdin>] [-in <indir>] \
 [-dids jobid[,jobid]*] [-email address] \
 [-art <art-path> -artid <art-identifier] [-artequal <art-value>] \
 [-artmin <art-value>] [-artmax <art-value>] <cmd> <arg1> ...
 -job batch [-gid <grid-identifier>] <xml-batch-submission-file>
 -job results -id <job-identifier> [-tid <task-identifier>] \
 [-so <stdout>] [-se <stderr>] [-out <outdir>]
 -job {stop | suspend | resume | delete | restart} -id <job-identifier>
 -job run [-gid <grid-identifier>] [-si <stdin>] [-in <indir>] \
 [-so <stdout>] [-se <stderr>] [-out <outdir>] [-email address] \
 [-art <art-path> -artid <art-identifier] [-artequal <art-value>] \
 [-artmin <art-value>] [-artmax <art-value>] <cmd> <arg1> ...

xgrid -?, or xgrid with no arguments, will print this usage message.

I don’t really know what the options mean, so I tried firing off a few, and I just kept getting the same output.  A bit disappointing.

I guessed that the hostname option was needed, but -hlocalhost just didn’t work for me, but eventually I found out that “-h localhost” would.  Well, not exactly work, but at least it complained that I needed to authenticate the command:

$ xgrid -h localhost -grid list
{
 error = "could not connect to localhost (Authentication failed)";
}

Adding a password with “-p password” did the trick.  Again, you do need the space between -p and the password.

$ xgrid -h localhost -p password -grid list
{
 gridList =     (
 0
 );
}

I don’t know what the output means here, but at least I was making progress.

Asking for a job list (I’m guessing here) gave me an empty list:

$ xgrid -h localhost -p password -job list
{
 jobList =     (
 );
}

which I expected since I haven’t submitted any jobs, so I tried sending a simple “ls” command.

$ xgrid -h localhost -p password -job submit ls
{
 jobIdentifier = 0;
}

In the Xgrid Admin tool I saw that the job had failed, so I figured it could be a path thing and gave it the full path of the job

$ xgrid -h localhost -p password -job submit /bin/ls
{
 jobIdentifier = 1;
}

and that seemed to do the trick:

JobsAsking for a job list with xgrid shows me two jobs

$ xgrid -h localhost -p password -job list{
 jobList =     (
 0,
 1
 );
}

I figured that -job results should give me the states of the jobs, like I could see them in the Admin tool, but I don’t get any output when I run that command, so I don’t know how that is supposed to work.

I can delete a job, though:

$ xgrid -h localhost -p password -job delete -id 0
{
}

Jobs (one deleted)but I still haven’t figured out how to get the status or output of the job from xgrid.

I guess it is time to stop experimenting and read the manual…

Update: Ok, I did just one more experiment.  If I run a program that is guaranteed to give some output I do get that output when I ask for the result.  I tried just running xgrid and I got the help text.  I guess the ls command I tried before was run in an empty directory and that is why it didn’t produce any output.

I still haven’t found a nice way to get the status, but the -job attributes command at least gives me a lot of info about the job including the jobStatus.

I still have some experimenting and reading to do before I get the grid up and running on some of the computations I am actually interested in, but I am optimistic now at least.

261-290=-29

HMMoC and HMMConverter

Friday, September 18th, 2009

I just want to say a few words about a short paper I read last week, and a paper that is a few years old now but related to it.

The first is out in advanced access in Nucleic Acids Research:

HMMConverter 1.0: a toolbox for hidden Markov models

Lam and Meyer

Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model’s architecture. The implementation of prediction algorithms and algorithms to train the model’s parameters, however, can be a time-consuming and error-prone task. We here present HMMCONVERTER, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMsand pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMCONVERTER canthus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.

the other was published in Bioinformatics in 2007:

HMMoC – a compiler for hidden Markov models

Lunter

Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Both papers describe compilers that generate C++ implementations of hidden Markov model algorithms from XML specifications, and really they are very similar.

The basic HMM algorithms are quite straightforward to implement, but if you want more complex models such as pair-HMMs or generalized HMMs there is a tad more complications to deal with, and if you need to optimize the algorithms in either runtime or memory usage there are some more complex algorithms you can use such as “banding” – implemented in both HMMoC and HMMConverter – that risk giving sub-optimal results but at a much reduced running time and memory consumption, or the Hirschberg algorithm – only implemented in HMMConverter as far as I can see – that exchanges a doubling in running time for a much reduced memory consumption.

Implementing such extra algorithms is not conceptually hard, but can be quite tedious and error prone, so it makes good sense to have code generators building the algorithms for you.  That is exactly what these tools do.

At a bird’s eye view, the tools are very similar.  You specify the HMM in an XML file (a specification language that I personally don’t like that much, but that is of course very subjective) and the tools then generate the algorithms you ask them to, output as C++ code.

HMMoC provides a number of handles for you to add your own C++ code to the generated code; I am not sure if HMMConverter does the same, but on the other hand HMMConverter provides handles for various constraints on the parameters so it might be easier to re-parameterize models made with that.

Another cool feature unique to HMMConverter is priors on sequence annotation.  You can provide an annotation to the input sequence(s) that is then incorporated in the emission probabilities.  The prior is really on hidden states, but incorporating them into the emission probabilities has exactly the effect you want from them: they weight the posterior probabilities of the hidden states along the input.

To deal with numerical issues, HMMConverter works in log-space while HMMoC uses something called “extended-exponent real numbers”.  Working in log-space can be really slow for the Forward and Backward algorithms, since you have to switch in and out of log-space to deal with sums of probabilities (the Viterbi algorithm doesn’t have this problem, so there the log-space solution is pretty fast).

Unfortunately, there isn’t any comparison between the execution times of algorithms generated with the two tools in the new paper, so I don’t know how much this matters.  In the HMM library I am developing with Andreas we found that the log-solution was very slow, though, and therefore we use a re-scaling approach instead.

I would love to see a comparison of the runtime efficiency between the approaches, but just not quite enough to go and do it myself right now…

  • Lam, T., & Meyer, I. (2009). HMMCONVERTER 1.0: a toolbox for hidden Markov models Nucleic Acids Research DOI: 10.1093/nar/gkp662
  • Lunter, G. (2007). HMMoC a compiler for hidden Markov models Bioinformatics, 23 (18), 2485-2487 DOI: 10.1093/bioinformatics/btm350

261-289=-28