Posts Tagged ‘scientific computing’

SPRINT: A new parallel framework for R

Friday, January 9th, 2009

There is a new framework for parallelising R code recently published in BMC Bioinformatics:

SPRINT: A new parelle framework for R

J. Hill et al. BMC Bioinformatics 2008, 9:558; doi:10.1186/1471-2105-9-558

Abstract

Background

Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest.

Results

We have designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing. As an example we created a function that carries out the computation of a pairwise calculated correlation matrix. This performs well with SPRINT. When executed using SPRINT on an HPC resource of eight processors this computation reduces by more than three times the time R takes to complete it on one processor.

Conclusions

SPRINT allows the biostatistician to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.

It is a different approach than the R/parallel framework that I wrote about here and here, but aiming at solving the same problem: speeding up R by using parallel execution.

As I see it, there are two main differences between R/parallel and SPRINT.

First, R/parallel is thread-based — so it aims at speeding up the code on a single processor — while SPRINT is MPI based — so it aims at distributing computation on a cluster of machines.

The latter is more useful for very computationally intensive tasks, where a single processor, even with several cores, is unlikely to be fast enough.  The former is probably more useful for many day-to-day data analysis tasks where setting up a cluster is over-kill.  It is probably also going to be more important in the future where we expect a serious increase in the number of cores on each CPU.

The other difference between R/parallel and SPRINT is the way they provide parallelism to R.

In R/parallel you have a way of taking hot-spots of your R code and wrapping them in code that runs “loops” in parallel.

SPRINT, on the other hand, does not give you any means of parallelising your R code, as such, but provides an interface for calling parallel code, provided to SPRINT in some way.

The idea is that important CPU intensive functions can be portet to a parallel implementation and provided to the SPRINT framework, and the framework then enables R scripts to call these functions.

In many ways, it is similar to how you would take CPU intensive functions and port them to, say, C, and then call the C code from R.

So SPRINT doesn’t really give you an easy way to speed up your code through parallel execution — you will need to port some of it to get that — but if you are already comfortable with porting parts of your analysis to get a speed-up, it looks like a nice way of getting the extra “bang for the buck” you can get out of a cluster.


Jon Hill, Matthew Hambley, Thorsten Forster, Muriel Mewissen, Terence M Sloan, Florian Scharinger, Arthur Trew, Peter Ghazal (2008). SPRINT: A new parallel framework for R BMC Bioinformatics, 9 (1) DOI: 10.1186/1471-2105-9-558

9-8 = 1

R/parallel

Tuesday, September 30th, 2008

ResearchBlogging.org

There’s a paper that just got out in BMC Bioinformatics that I found an interesting read.

R/parallel – speeding up bioinformatics analysis with R
Gonzalo Vera, Ritsert C Jansen and Remo L Suppi

BMC Bioinformatics 2008, 9:390 doi:10.1186/1471-2105-9-390

Background

R is the preferred tool for statistical analysis of many bioinformaticians due in part to the increasing number of freely available analytical methods. Such methods can be quickly reused and adapted to each particular experiment. However, in experiments where large amounts of data are generated, for example using high-throughput screening devices, the processing time required to analyze data is often quite long. A solution to reduce the processing time is the use of parallel computing technologies. Because R does not support parallel computations, several tools have been developed to enable such technologies. However, these tools require multiple modications to the way R programs are usually written or run. Although these tools can finally speed up the calculations, the time, skills and additional resources required to use them are an obstacle for most bioinformaticians.

Results

We have designed and implemented an R add-on package, R/parallel, that extends R by adding user-friendly parallel computing capabilities. With R/parallel any bioinformatician can now easily automate the parallel execution of loops and benefit from the multicore processor power of today’s desktop computers. Using a single and simple function, R/parallel can be integrated directly with other existing R packages. With no need to change the implemented algorithms, the processing time can be approximately reduced N-fold, N being the number of available processor cores.

Conclusions

R/parallel saves bioinformaticians time in their daily tasks of analyzing experimental data. It achieves this objective on two fronts: first, by reducing development time of parallel programs by avoiding reimplementation of existing methods and second, by reducing processing time by speeding up computations on current desktop computers. Future work is focused on extending the envelope of R/parallel by interconnecting and aggregating the power of several computers, both existing office computers and computing clusters.

It concerns an extension module for R that helps parallelising code on multi-core machines.

This is an important issue.  Data analysis is getting relatively slower and slower, as the data size increases faster than the improvements in CPU speed, so exploiting the parallelism in modern CPUs will get increasingly important.

It is not quite straightforward to do this, however.  Writing concurrent programs is much harder than writing sequential programs, and that is hard enough as it is.

Problems with concurrent programs

The two major difficulties in programming is getting it right and making it fast and both are much more difficult with parallel programs.

When several threads are running concurrently, you need all kinds of synchronisation to ensure that the input of one calculation is actually available before you start computing and to prevent threads from corrupting data structures by updating them at the same time.

This synchronisation is not only hard to get right, it also carries an overhead that can be very hard to reason about.  Just like it is difficult to know which parts of a program is using most of the CPU time without profiling, it is difficult to know which parts of a concurrent program are the synchronisation bottlenecks.

Hide the parallelism as a low-level detail

Since concurrency is so hard to get right, we would like to hide it away as much as we can. Just like we hide assembler instructions away and program in high-level languages.

This is by no means easy to do, and the threading libraries in e.g. Python and Java are not even close to doing that.

In a language such as R or Matlab, you potentially have an easier way of achieving it. A lot of operations are “vectorized”, i.e. you have a high-level instruction for performing multiple operations on vectors or matrices.  Rather than multiplying all elements in a vector using a loop

> v <- c(1,2,3)
> w <- v
> for (i in 1:3) w[i] <- 2*w[i]
> w
[1] 2 4 6

you do the multiplication in a single operation

> 2*v
[1] 2 4 6

and rather than multiplying matrices explicitly in a triple loop,

> A <- matrix(c(1,2,3,4),nc=2) ; B <- matrix(c(4,3,2,1),nc=2)
> C <- matrix(0,nc=2,nr=2)
> for (i in 1:2) for (j in 1:2) for (k in 1:2) C[i,j] <- C[i,j]+A[i,k]*B[k,j]
> C
     [,1] [,2]
[1,]   13    5
[2,]   20    8

you have a single operation for it.

> A %*% B
     [,1] [,2]
[1,]   13    5
[2,]   20    8

It is almost trivial to parallelise such operations, and if your program consists of a lot of such high-level operations, much program parallelisation can be automated.

R/parallel

This is the reasoning behind the BMC Bioinformatics paper.

The are not exactly doing what I describe above — that would be the “right” thing to do, but would require changes to the R system that you cannot directly do through an add-on package — but they provide an operation for replacing sequential loops with a parallel version.

Just replacing a sequential loop with parallel execution, assuming that the operations are independent, is always safe. The behaviour of the sequential and the parallel program is exactly the same, except for the execution time.

As such there is no extra mental overhead in introducing it to your programs.

Using it won’t necessarily speed up the program, of course. Even if the synchronisation is hidden from the programmer, the overhead is still there.

The authors leave it to the programmer to know when the parallel execution pays off (and profiling should tell him so).

It is quite likely that a sequence of small fast tasks is parallelized and, despite parallel execution, as a result of the transformation process and additional synchronization, the overall processing time can be increased. To avoid this situation, the design decision made is to let the users indicate which code regions (i.e. loops) they need to speed up.

Knowing what to run in parallel and what not to is a hard problem.  It will often depend on the data as well, if nothing else the data size.

A modern virtual machine would be able to profile the execution of the program and make a decision based on that.  Assuming it is operations that are executed more than once.

I don’t know if any virtual machines are actually doing this, though. I really have no idea how much automatic parallelisation is part of the back end of virtual machines.

I would love to know, though.  This is the way to go, to get the speedups of parallelisation without too high a mental burden for the scientist/programmer.


Gonzalo Vera, Ritsert C Jansen, Remo L Suppi (2008). R/parallel – speeding up bioinformatics analysis with R BMC Bioinformatics, 9 (390)

Google cluster computing

Tuesday, February 26th, 2008

Google, together with the National Science Foundation (NFS; National here is the US) — possibly IBM as well, it isn’t quite clear from the press release — will provide cluster computing to researchers.

This YouTube video describes a Google + IBM project that now looks like it’s only a pilot for a larger one:

 

Yeah! Let’s have more of that, but remember to make it easy to use for scientists. Integrate cluster computers with the desktop!

The video mentions integration with Eclipse — does anyone know more about this?

Let’s kill desktop computing

Tuesday, January 22nd, 2008

I don’t want to get rid of the desktop computer. I like it. It is a nice interface for communicating with my computation tasks, not to mention messaging, emailing, blogging, image editing etc. It is just that I don’t want my desktop computer to run most of my computation tasks. The desktop computer should be for interacting with computations, but there is really no reason why it should also carry out those computations.

A typical situation for me is that either I have very light computational requirements — what is needed for text processing or maybe compiling a TeX document — or I need a lot of computer power — when I am running my data analysis in my research. I don’t think that this is atypical for scientists, at least it is a situation I share with most of the people at BiRC.

My desktop computers are not powerful enough to deal with my scientific computing — well, they are, but it takes ages to run on them — but they have plenty of power to spare when I am just doing “office work”.

Grid computing

The solution has been around for years and is called grid computing. Even way back when I was teaching networks and distributed systems at the computer science department, we would cover the ideas behind grid computing (back then it was really just client/server architectures, RPC and later RMI, distributed file systems etc., but the ideas that are now called grid computing were around).

What we really want is to connect all the computers on the net into one big honking system where we can get the computer power we need, when we need it, from all those machines that are idle anyway. On the rare occasions where we need a super computer, we want to be connected to that as well. Of course, we do not want to pay for having a whole network of computers just standing around waiting for us to need them — much less having a super computer sitting idle waiting for us — but when we need the computer power, we want to be connected to it.

SUN tried to sell the idea with the slogan the network is the computer. I don’t really know how well that went, but I haven’t heard the slogan for years, so that successful it can’t have been.

The grid is such a great idea, so why isn’t it widespread already? Why am I still using my personal desktop computer to run my computations?

Personal experiences

I’ve had a bit experience with grid computing myself. While developing GeneRecon, I needed a lot of computers to test the software — pretty time consuming in itself and I needed to explore a large parameter space — so I got access to NorduGrid. It was a horrible experience. Setting up the grid to run my own software was such a hassle and never really worth the (limited) CPU cycles I got out of it.

Then I got access to the new Minimal intrusion Grid (MiG) developed by Brian Vinter’s group. That was an improvement over NorduGrid, and good enough to finish the GeneRecon experiments. See

Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.

for details.

It was an improvement, but it wasn’t a great experience.

Running programs on MiG requires a lot of extra work. First input files must be uploaded to the grid. Also the executable for the program, if I haven’t uploaded it already (and I’m ignoring problems with figuring out where the executable can actually be executed and such). The the job must be specified through a configuration language and submitted to the grid. When the job is executing I have to poll it from time to time to get its status. When done, I have to download the output files and clean up after the job.

Compare that to just running the program on my own computer.

I have used MiG for a couple of projects now, but for day to day work, it is just too much of a hassle.

Does it have to be so hard?

Why shouldn’t it be just as easy to run programs on the grid as on the desktop computer?

I know, if I want a distributed system with all the bells and whistles, then it is a more complicated problem than writing single machine applications, but for the cases where I just want sufficient computer power to fire off a few independent computations in parallel, there shouldn’t be any problem.

There is, but there shouldn’t be!

To access files, why should I need to up- and download? I should just mount a file system in some appropriate way, right?

To run a program on the grid, couldn’t I just distribute it to another node when loading the program?

It probably isn’t quite that easy, but by wrapping my programs in proxy executables, I should be able to achieve something very similar, at least. I’ve actually played with such a system for MiG — called MyMiG — so I know that at least something in that direction can be achieved. It just needs a bit more work (which is reasonable, since my solution took a weekend to cook up).

I realize that more complex distributed applications will need more work, but with XML-RPC and SOAP and whatnot, it shouldn’t be that much of a problem to get there.

With a proper grid setup, I could get the computer resources I need, and my desktop computer would only be needed to interact with my programs, not run them. Actually, with a proper setup, I should be able to access my computations from any computer — desktop, laptop or even smart phone — everwhere.

Can we get there?

What will it take to get to that point? Does there already exist systems out there that works this way? I’ve heard Xgrid mentioned, but do not really know anything about it, does anyone know how it works?

Last week, google annouced that they would offer free storage of scientific data. Would it be too optimistic to think that within a year, some company would offer free grid computation resources? It doesn’t even have to be completely free, you could imagine a setup where you provide your “screensaver” CPU cycles — like seti@home etc. — for access to the grid for your own tasks. With an open platform for this, shouldn’t the open source community then be able to build a great interface to it?