Posts Tagged ‘distributed computing’

Some thoughs on grid computing…

Wednesday, October 8th, 2008

Earlier this week, the LHC Computing Grid went online.  A description of the system can be found here, and blog posts about it here, here and here.

This got me thinking about grid computing for small scale scientists like myself.

I’ve had some experience with grid computing (see an old post about it here) but mostly I have found it too much trouble to be worth the effort.

Our typical computer use

For large projects that require years of CPU time, it is well worth the effort to set up the infrastructure to run computations on grids.  You really need it to get your the computations done, and the overhead is very small in comparison with the actual computation time.

Most of my projects — and most of the projects we do at BiRC — are a bit different.

We do need the computation power, but we are usually tinkering with our programs for most of a project — since we rarely know exactly how to analyse our data until we are mostly done with it — so we cannot just distribute a fixed version of our software and then start distributing the computations.

The typical work flow is that we write a program for our analysis, then we run the analysis and when we look at the results we find some strange results here and there. Then we extend the software to either extract more information from the data, or to fix a bug that caused the weird results.

We then need to run the analysis again, and repeat the process.

The analysis might take a few CPU days to a few CPU months — so it is small scale for grid applications — but between each analysis we spend a week or so modifying and testing our software.

We have a small cluster of Linux computers for this, and it is always in one of two states: completely overloaded or burning idle cycles.

This is the situation grid computing could fix.  Theoretically we should be able to get CPU cycles off the grid when we need it, and sell it to the grid when we are not running computations ourselves.

In practice, our work pattern makes this difficult.

The problems with small scale grid computing

If you are changing your software all the time, you need to distribute it together with the data you analyse.

This means you either send compiled binaries with the job submissions, or you compile the software as part of the job.

The former is fine if you have a program you can compile — and you’d better link it statically ’cause there is no guarantees about the libraries you can find on the resources that will run it.

If you have a bunch of scripts, you are not so lucky.

There are no guarantees that the computer that will run the computations has the script interpreter — or if it does that it is a version that can run your script — and even if it does, what about the modules you need?

You don’t want to have to compile BioPython or SciPy on a grid machine just to run your scripts.  The overhead in CPU time is going to be several percentage of your actual run (at least if you parallelise your computations to high enough a degree to be worth the grid in the first place), and how can you even know that there is a compiler to compile it at the other end?  You can’t, and there probably isn’t unless you are very lucky.

It is a major pain to see your jobs aborted after slowly making their way through the job queue, just because the host computer cannot even setup the environment you need for your computations.

What can we do about it?

If we want to use the grid for even smaller scale computations, at the very least we need an easier way to distribute new versions of our programs.

I have an idea for this.

Some grids, at least, are already dealing with “runtime environments” where you can specify that your job needs to run in a certain runtime environment, and the scheduler will only send your jobs to resources that can provide that environment.

This sounds like just the thing, but the catch is that it is up to the resource administrators to set up these environments and to tell the grid system that they provide them.

For something like LHC, it is probably not a problem to convince administrators to provide the right environment, but for Thomas Mailund it is.

What we need is a way for the grid users to be able to install environments on the resources!

So how about this: we introduce the concept of “runtime environment packages” that we can upload to the grid system.  They consists of a setup script (configure ; make) and a test suite, for example.

When a resource is idle, it tests if there are new environments available in queue, downloads these, and tries to build and test them.  If it succeeds, it informs the grid system that it can run the new type of environment.  The scheduler only sends jobs to resources that have the right environments, so if your environment tests are working properly, you never end up on a resource that cannot run your jobs.

We could even add environment requirements on the environment packages, so they don’t have to be self-contained.  E.g. to install SciPy, you don’t want to have to install Python itself, and there is no reason for resources without Python to try to install it only to give up.

To prevent resources to be filled up with old environment, we can add a time out period to environements, so they are deleted when they haven’t been used for a couple of days/weeks/months.

It shouldn’t be that hard to implement.  I am sure I could do it, but I don’t have my own grid infrastructure to work with, so I guess I’ll have to intimidate persuade someone else to do it…

Workflows

Saturday, March 8th, 2008

Neil Saunders asks: Can every workflow be automate?

Workflows is something I’ve been thinking about myself, especially in the context of grid computing.

My “grid computing” collaborators are working on ways of running workflows on grid resources. This is a good idea, but I am worried about figuring out the workflows in the first place.

Quoting Saunders:

To me a workflow is rather like a scientific paper: an artificial summary of your work that you put together at the end, describing an imaginary path from starting point to destination that you couldn’t know you were going to follow when you set out. Useful for others who want to follow the same path, less so for the person blazing the trail.

I agree completely on this.

I spend much more time on figuring out how to analyse my data, then I ever spend on the actual data analysis.

Of course, I am still working with workflows when I am doing this, but I am fiddling with it all the time, and I have to go in and look at intermediate results in each step in the workflow to make sure everything is running the way it is supposed to.

Giving me tools to efficiently run finished workflows is not going to help me much. Better tools for experimenting with workflows, on the other hand, would win you a beer from me.

Google cluster computing

Tuesday, February 26th, 2008

Google, together with the National Science Foundation (NFS; National here is the US) — possibly IBM as well, it isn’t quite clear from the press release — will provide cluster computing to researchers.

This YouTube video describes a Google + IBM project that now looks like it’s only a pilot for a larger one:

 

Yeah! Let’s have more of that, but remember to make it easy to use for scientists. Integrate cluster computers with the desktop!

The video mentions integration with Eclipse — does anyone know more about this?

Let’s kill desktop computing

Tuesday, January 22nd, 2008

I don’t want to get rid of the desktop computer. I like it. It is a nice interface for communicating with my computation tasks, not to mention messaging, emailing, blogging, image editing etc. It is just that I don’t want my desktop computer to run most of my computation tasks. The desktop computer should be for interacting with computations, but there is really no reason why it should also carry out those computations.

A typical situation for me is that either I have very light computational requirements — what is needed for text processing or maybe compiling a TeX document — or I need a lot of computer power — when I am running my data analysis in my research. I don’t think that this is atypical for scientists, at least it is a situation I share with most of the people at BiRC.

My desktop computers are not powerful enough to deal with my scientific computing — well, they are, but it takes ages to run on them — but they have plenty of power to spare when I am just doing “office work”.

Grid computing

The solution has been around for years and is called grid computing. Even way back when I was teaching networks and distributed systems at the computer science department, we would cover the ideas behind grid computing (back then it was really just client/server architectures, RPC and later RMI, distributed file systems etc., but the ideas that are now called grid computing were around).

What we really want is to connect all the computers on the net into one big honking system where we can get the computer power we need, when we need it, from all those machines that are idle anyway. On the rare occasions where we need a super computer, we want to be connected to that as well. Of course, we do not want to pay for having a whole network of computers just standing around waiting for us to need them — much less having a super computer sitting idle waiting for us — but when we need the computer power, we want to be connected to it.

SUN tried to sell the idea with the slogan the network is the computer. I don’t really know how well that went, but I haven’t heard the slogan for years, so that successful it can’t have been.

The grid is such a great idea, so why isn’t it widespread already? Why am I still using my personal desktop computer to run my computations?

Personal experiences

I’ve had a bit experience with grid computing myself. While developing GeneRecon, I needed a lot of computers to test the software — pretty time consuming in itself and I needed to explore a large parameter space — so I got access to NorduGrid. It was a horrible experience. Setting up the grid to run my own software was such a hassle and never really worth the (limited) CPU cycles I got out of it.

Then I got access to the new Minimal intrusion Grid (MiG) developed by Brian Vinter’s group. That was an improvement over NorduGrid, and good enough to finish the GeneRecon experiments. See

Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.

for details.

It was an improvement, but it wasn’t a great experience.

Running programs on MiG requires a lot of extra work. First input files must be uploaded to the grid. Also the executable for the program, if I haven’t uploaded it already (and I’m ignoring problems with figuring out where the executable can actually be executed and such). The the job must be specified through a configuration language and submitted to the grid. When the job is executing I have to poll it from time to time to get its status. When done, I have to download the output files and clean up after the job.

Compare that to just running the program on my own computer.

I have used MiG for a couple of projects now, but for day to day work, it is just too much of a hassle.

Does it have to be so hard?

Why shouldn’t it be just as easy to run programs on the grid as on the desktop computer?

I know, if I want a distributed system with all the bells and whistles, then it is a more complicated problem than writing single machine applications, but for the cases where I just want sufficient computer power to fire off a few independent computations in parallel, there shouldn’t be any problem.

There is, but there shouldn’t be!

To access files, why should I need to up- and download? I should just mount a file system in some appropriate way, right?

To run a program on the grid, couldn’t I just distribute it to another node when loading the program?

It probably isn’t quite that easy, but by wrapping my programs in proxy executables, I should be able to achieve something very similar, at least. I’ve actually played with such a system for MiG — called MyMiG — so I know that at least something in that direction can be achieved. It just needs a bit more work (which is reasonable, since my solution took a weekend to cook up).

I realize that more complex distributed applications will need more work, but with XML-RPC and SOAP and whatnot, it shouldn’t be that much of a problem to get there.

With a proper grid setup, I could get the computer resources I need, and my desktop computer would only be needed to interact with my programs, not run them. Actually, with a proper setup, I should be able to access my computations from any computer — desktop, laptop or even smart phone — everwhere.

Can we get there?

What will it take to get to that point? Does there already exist systems out there that works this way? I’ve heard Xgrid mentioned, but do not really know anything about it, does anyone know how it works?

Last week, google annouced that they would offer free storage of scientific data. Would it be too optimistic to think that within a year, some company would offer free grid computation resources? It doesn’t even have to be completely free, you could imagine a setup where you provide your “screensaver” CPU cycles — like seti@home etc. — for access to the grid for your own tasks. With an open platform for this, shouldn’t the open source community then be able to build a great interface to it?