There's a paper that just got out in BMC Bioinformatics that I found an interesting read.
R/parallel - speeding up bioinformatics analysis with R
Gonzalo Vera, Ritsert C Jansen and Remo L Suppi
BMC Bioinformatics 2008, 9:390 doi:10.1186/1471-2105-9-390
R is the preferred tool for statistical analysis of many bioinformaticians due in part to the increasing number of freely available analytical methods. Such methods can be quickly reused and adapted to each particular experiment. However, in experiments where large amounts of data are generated, for example using high-throughput screening devices, the processing time required to analyze data is often quite long. A solution to reduce the processing time is the use of parallel computing technologies. Because R does not support parallel computations, several tools have been developed to enable such technologies. However, these tools require multiple modications to the way R programs are usually written or run. Although these tools can finally speed up the calculations, the time, skills and additional resources required to use them are an obstacle for most bioinformaticians.
We have designed and implemented an R add-on package, R/parallel, that extends R by adding user-friendly parallel computing capabilities. With R/parallel any bioinformatician can now easily automate the parallel execution of loops and benefit from the multicore processor power of today's desktop computers. Using a single and simple function, R/parallel can be integrated directly with other existing R packages. With no need to change the implemented algorithms, the processing time can be approximately reduced N-fold, N being the number of available processor cores.
R/parallel saves bioinformaticians time in their daily tasks of analyzing experimental data. It achieves this objective on two fronts: first, by reducing development time of parallel programs by avoiding reimplementation of existing methods and second, by reducing processing time by speeding up computations on current desktop computers. Future work is focused on extending the envelope of R/parallel by interconnecting and aggregating the power of several computers, both existing office computers and computing clusters.
It concerns an extension module for R that helps parallelising code on multi-core machines.
This is an important issue. Data analysis is getting relatively slower and slower, as the data size increases faster than the improvements in CPU speed, so exploiting the parallelism in modern CPUs will get increasingly important.
It is not quite straightforward to do this, however. Writing concurrent programs is much harder than writing sequential programs, and that is hard enough as it is.
Problems with concurrent programs
The two major difficulties in programming is getting it right and making it fast and both are much more difficult with parallel programs.
When several threads are running concurrently, you need all kinds of synchronisation to ensure that the input of one calculation is actually available before you start computing and to prevent threads from corrupting data structures by updating them at the same time.
This synchronisation is not only hard to get right, it also carries an overhead that can be very hard to reason about. Just like it is difficult to know which parts of a program is using most of the CPU time without profiling, it is difficult to know which parts of a concurrent program are the synchronisation bottlenecks.
Hide the parallelism as a low-level detail
Since concurrency is so hard to get right, we would like to hide it away as much as we can. Just like we hide assembler instructions away and program in high-level languages.
This is by no means easy to do, and the threading libraries in e.g. Python and Java are not even close to doing that.
In a language such as R or Matlab, you potentially have an easier way of achieving it. A lot of operations are "vectorized", i.e. you have a high-level instruction for performing multiple operations on vectors or matrices. Rather than multiplying all elements in a vector using a loop
> v <- c(1,2,3)
> w <- v
> for (i in 1:3) w[i] <- 2*w[i]
 2 4 6
you do the multiplication in a single operation
 2 4 6
and rather than multiplying matrices explicitly in a triple loop,
> A <- matrix(c(1,2,3,4),nc=2) ; B <- matrix(c(4,3,2,1),nc=2)
> C <- matrix(0,nc=2,nr=2)
> for (i in 1:2) for (j in 1:2) for (k in 1:2) C[i,j] <- C[i,j]+A[i,k]*B[k,j]
[1,] 13 5
[2,] 20 8
you have a single operation for it.
> A %*% B
[1,] 13 5
[2,] 20 8
It is almost trivial to parallelise such operations, and if your program consists of a lot of such high-level operations, much program parallelisation can be automated.
This is the reasoning behind the BMC Bioinformatics paper.
The are not exactly doing what I describe above -- that would be the "right" thing to do, but would require changes to the R system that you cannot directly do through an add-on package -- but they provide an operation for replacing sequential loops with a parallel version.
Just replacing a sequential loop with parallel execution, assuming that the operations are independent, is always safe. The behaviour of the sequential and the parallel program is exactly the same, except for the execution time.
As such there is no extra mental overhead in introducing it to your programs.
Using it won't necessarily speed up the program, of course. Even if the synchronisation is hidden from the programmer, the overhead is still there.
The authors leave it to the programmer to know when the parallel execution pays off (and profiling should tell him so).
It is quite likely that a sequence of small fast tasks is parallelized and, despite parallel execution, as a result of the transformation process and additional synchronization, the overall processing time can be increased. To avoid this situation, the design decision made is to let the users indicate which code regions (i.e. loops) they need to speed up.
Knowing what to run in parallel and what not to is a hard problem. It will often depend on the data as well, if nothing else the data size.
A modern virtual machine would be able to profile the execution of the program and make a decision based on that. Assuming it is operations that are executed more than once.
I don't know if any virtual machines are actually doing this, though. I really have no idea how much automatic parallelisation is part of the back end of virtual machines.
I would love to know, though. This is the way to go, to get the speedups of parallelisation without too high a mental burden for the scientist/programmer.
Gonzalo Vera, Ritsert C Jansen, Remo L Suppi (2008). R/parallel - speeding up bioinformatics analysis with R BMC Bioinformatics, 9 (390)