SPRINT: A new parallel framework for R

There is a new framework for parallelising R code recently published in BMC Bioinformatics:

SPRINT: A new parelle framework for R

J. Hill et al. BMC Bioinformatics 2008, 9:558; doi:10.1186/1471-2105-9-558

Abstract

Background

Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest.

Results

We have designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing. As an example we created a function that carries out the computation of a pairwise calculated correlation matrix. This performs well with SPRINT. When executed using SPRINT on an HPC resource of eight processors this computation reduces by more than three times the time R takes to complete it on one processor.

Conclusions

SPRINT allows the biostatistician to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.

It is a different approach than the R/parallel framework that I wrote about here and here, but aiming at solving the same problem: speeding up R by using parallel execution.

As I see it, there are two main differences between R/parallel and SPRINT.

First, R/parallel is thread-based — so it aims at speeding up the code on a single processor — while SPRINT is MPI based — so it aims at distributing computation on a cluster of machines.

The latter is more useful for very computationally intensive tasks, where a single processor, even with several cores, is unlikely to be fast enough.  The former is probably more useful for many day-to-day data analysis tasks where setting up a cluster is over-kill.  It is probably also going to be more important in the future where we expect a serious increase in the number of cores on each CPU.

The other difference between R/parallel and SPRINT is the way they provide parallelism to R.

In R/parallel you have a way of taking hot-spots of your R code and wrapping them in code that runs “loops” in parallel.

SPRINT, on the other hand, does not give you any means of parallelising your R code, as such, but provides an interface for calling parallel code, provided to SPRINT in some way.

The idea is that important CPU intensive functions can be portet to a parallel implementation and provided to the SPRINT framework, and the framework then enables R scripts to call these functions.

In many ways, it is similar to how you would take CPU intensive functions and port them to, say, C, and then call the C code from R.

So SPRINT doesn’t really give you an easy way to speed up your code through parallel execution — you will need to port some of it to get that — but if you are already comfortable with porting parts of your analysis to get a speed-up, it looks like a nice way of getting the extra “bang for the buck” you can get out of a cluster.


Jon Hill, Matthew Hambley, Thorsten Forster, Muriel Mewissen, Terence M Sloan, Florian Scharinger, Arthur Trew, Peter Ghazal (2008). SPRINT: A new parallel framework for R BMC Bioinformatics, 9 (1) DOI: 10.1186/1471-2105-9-558

9-8 = 1

More PSB

If you are not familiar with PSB (the conference I mentioned in my previous post), check out this post by Russ Altman, one of the PSB organisers.

It is a conference I strongly recommend that you attend, if you have the chance.

Despite that my last attendance was a bit of a catastrophe… I’ll tell you about it later…

9-7 = 2