Posts Tagged ‘parallel computing’

Multi-core HMMs

Thursday, June 11th, 2009

I’m working on a project where it looks like I could end up with hidden Markov models with hundreds if not thousands of states, and with transition matrices where all entries are non-zero.  Since, for such models, the running time for just calculating the likelihood (not even optimising it) is O(m\cdot n^2) where m is the length of the input and n is the number of states, this could be a problem.  Especially since I’m working on whole genome data, so m can get pretty large as well.

It is part of our CoalHMM work, but I’ll tell you more about that later.  For this post, the actual application isn’t all that important, only the size of the model.

So, anyway, needless to say I need a fast implementation if this is ever going to work.

HMM algorithms

The basic hidden Markov model algorithms are essentially just filling in dynamic programming tables.

The “Forward” algorithm, for example, looks something like

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

There’s a few more details in a real implementation, ’cause this version will have underflows in a few steps, but this is the essential idea.

SIMD speedups

We have developed a small library with these algorithms here at BiRC (mainly implemented by one of my student programmers, Andreas) that speeds up the algorithms using single instruction, multiple data (SIMD) instructions.

The idea here is that we can calculate the sums as above by adding several entries in parallel.  For doubles we can handle two numbers in parallel, for floats we can handle four, so ideally we can speed up the algorithms by a factor of two or four.

We don’t exactly achieve that – there is some overhead involved – but it is not too bad.  The figure below compares the running time with and without SIMD for varying number of states, n, with a fixed input length m=10,000.

The number type – float or double – doesn’t matter for the plain algorithm since the same number of operations is used, so I’ve only plotted the time for doubles.

This is all good and well, but with really large HMMs I think we will need even more of a boost.

Multi-core speedups

It is not hard to get your hands on a multi-core machine these days – if anything it is hard to get a single core machine.  Not massively multi-core machines, although I expect to see those in a few years, but at least dual or quad core machines. So I was hoping to get another factor of two or four from running two or four threads.

In general, parallelising the instructions in the program is harder than parallelising the data operations as we do with the SIMD.  For one thing, there is a lot of synchronisation issues when running multiple threads, and this leads to a lot of overhead.

When filling in the dynamic programming table as above, each column in the table depends on the previous column, but each cell in the column is independent from the other cells in the column.  Thus, we should be able to process each cell in a column independently, but synchronise the threads after each column so we ensure that the previous column is always fully processed.

Since the synchronisation isn’t free, we expect that we need a large number of rows in the table to make it worthwhile, but in my project I expect hundreds to thousands of states, so perhaps there is some gain to it.

I decided to try it out, anyway.

Setting up threads and synchronising them and all that used to require some programming effort, but now that we have OpenMP it is very easy to add parallelism to a program.  To parallelise the algorithm above, you simply add a pragma:

for (int i = 0; i < m; ++i) {
#pragma omp parallel for
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

and by magic the middle loop is parallelised so the body of it is run in different threads.  Setting up the threads, assigning ranges of the sum to the different threads, synchronising them after the loop, etc. is all taken care of by the compiler.

Neat.

The results are plotted below (on my dual core iMac):

As is evident, there is a large overhead in the threaded version, so for anything below a few hundred states it is slowing the algorithm down instead of speeding it up, but eventually the parallelism does kick in.

Well, for the version without SIMD at least, which is the one I’ve plotted above.

For double SIMD, it is just worth it for the large number of states

while for floats it isn’t really worthwhile for this range of number of states

Of course, this is just the very first attempt at parallelising the algorithms, so there might still be some tricks to try.  For one thing, I have a feeling I can get rid of some of the thread overhead by adding a pragma outside the outer loop. I recall seeing something like that before but I don’t remember quite how it worked.

Anyway, I’ll play with it some more later.

162-165=-3

Making the most of multicore

Saturday, February 21st, 2009

I have blogged a bit about tools for parallelising R code in the past, and back in October I was interviewed for Genomeweb about my thoughts on parallel computing and R/parallel.

I completely missed the piece when it came out, though, but I stumpled upon it today: Making the most of multicore. (login required)

Go read it!

52-73=-21

SPRINT: A new parallel framework for R

Friday, January 9th, 2009

There is a new framework for parallelising R code recently published in BMC Bioinformatics:

SPRINT: A new parelle framework for R

J. Hill et al. BMC Bioinformatics 2008, 9:558; doi:10.1186/1471-2105-9-558

Abstract

Background

Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest.

Results

We have designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing. As an example we created a function that carries out the computation of a pairwise calculated correlation matrix. This performs well with SPRINT. When executed using SPRINT on an HPC resource of eight processors this computation reduces by more than three times the time R takes to complete it on one processor.

Conclusions

SPRINT allows the biostatistician to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.

It is a different approach than the R/parallel framework that I wrote about here and here, but aiming at solving the same problem: speeding up R by using parallel execution.

As I see it, there are two main differences between R/parallel and SPRINT.

First, R/parallel is thread-based — so it aims at speeding up the code on a single processor — while SPRINT is MPI based — so it aims at distributing computation on a cluster of machines.

The latter is more useful for very computationally intensive tasks, where a single processor, even with several cores, is unlikely to be fast enough.  The former is probably more useful for many day-to-day data analysis tasks where setting up a cluster is over-kill.  It is probably also going to be more important in the future where we expect a serious increase in the number of cores on each CPU.

The other difference between R/parallel and SPRINT is the way they provide parallelism to R.

In R/parallel you have a way of taking hot-spots of your R code and wrapping them in code that runs “loops” in parallel.

SPRINT, on the other hand, does not give you any means of parallelising your R code, as such, but provides an interface for calling parallel code, provided to SPRINT in some way.

The idea is that important CPU intensive functions can be portet to a parallel implementation and provided to the SPRINT framework, and the framework then enables R scripts to call these functions.

In many ways, it is similar to how you would take CPU intensive functions and port them to, say, C, and then call the C code from R.

So SPRINT doesn’t really give you an easy way to speed up your code through parallel execution — you will need to port some of it to get that — but if you are already comfortable with porting parts of your analysis to get a speed-up, it looks like a nice way of getting the extra “bang for the buck” you can get out of a cluster.


Jon Hill, Matthew Hambley, Thorsten Forster, Muriel Mewissen, Terence M Sloan, Florian Scharinger, Arthur Trew, Peter Ghazal (2008). SPRINT: A new parallel framework for R BMC Bioinformatics, 9 (1) DOI: 10.1186/1471-2105-9-558

9-8 = 1

Genome Technology interview

Wednesday, October 15th, 2008

Last week I gave an interview to Genome Technology about R/parallel (that I’ve blogged about a few days earlier).

I don’t know if the article about R/parallel is out yet — I haven’t seen it — but below you can see the questions I got and the answers I gave…

The interview

When it comes to parallelizing software such as R, what are the inherent challenges beyond the average bench biologists?

There is a lot of parallelism going on in modern hardware, most of which you never worry about.  The compilers and CPUs take care of it for you. This has to do with how data and program instructions float around on the silicon, and usually not something you want to know about unless you develop hardware or compilers. When you do notice it, what you usually notice is how data from main RAM is vastly slower to access than data in the cache.  If you are careful with your data structures you can avoid this problem in most cases by just following a few rules of thumb (and all that means is that you should try to keep data that is accessed together, close to each other in memory address).

In languages like C++ you will also notice how virtual functions are much slower to call than non-virtual functions.  This is also a consequence of parallelism.  When you are going to call a function — which means that instructions now needs to be read from a different place in memory — the CPU can see where you are going and start fetching the instructions.  With virtual functions, you don’t know where you are going until you’ve computed exactly where you want to execute (this is called virtual table lookup, but it just means that you are computing a point in the instructions to jump to, rather than have that point explicit in the instructions).  In that case, the CPU cannot start fetching the instructions, so there will be a delay in the function call.

But so far I am not really answering the question; I’m just warming up to that.

I just wanted to make it clear that parallel processes is not new to computing and doesn’t need to be something we have to worry about. Except that we do have to worry about it right now, until systems developers can hide it away under layers of abstraction — just like R/parallel is trying to do.

Why do we have to worry about it now?

To make CPUs faster it is no longer sufficient to just add more and more transistors, closer and closer together, on a chip.  There is a physical limit to how much longer we can do that.

If we cannot increase the speed of the individual processors, then at least we can do more in parallel, but now the software needs to help the hardware.  The parallelisation needs to be dealt with in the
software layer.

On your desktop computer, you are seeing this in several places.

  1. The CPU has instructions, so called SIMD, for performing the same operation on multiple data in parallel.
  2. The CPU has multiple cores (it is hard to find one now with less than two and soon that will be four)
  3. Your graphics card is a very powerful computer in itself (the processor there is called a GPU, and is a bit harder to program than a CPU, but that will change over time).

I’m not going to comment on 3. any further.  Using GPUs for scientific computing is very hot these days, but is probably best left to the computer scientists for now.

The parallelisation in 1. is relatively easy to work with as a programmer.  There are some hardware considerations to deal with, like how data should be formatted in memory, but it is not conceptually hard to deal with.

Even better, it can be automatically applied to a large degree in languages like R.  In C/C++ for example, you don’t have high-level constructs to perform an operation on all elements in a vector (which is the kind of operations ideal for this kind of parallelisation).  In R you do, so whereas it can be hard for a C/C++ compiler to automatically use this kind of parallelisation, in R it would be
almost trivial to apply it.  This is what I wrote about in my blog post.

The parallelisation in 2. is more difficult to deal with.

Even if computations have been running on highly parallelised hardware for a while, conceptually the programs have been “single threaded”. At every point of time in the computation, conceptually a single instruction is being executed, and the state of the program is deterministically determined by the input and the instructions executed so far.

With multiple cores, the program can run several threads in parallel, and the program is now conceptually parallelised.

Multiple threads are not new to computer science.  We have had them as a concept for decades.  Even when the actual processor could only execute a single instruction at a time, it would sometimes be beneficial to think of the program as running in parallel.  And for various reasons that i wont go into, you could also get significant speed improvements, but for different reasons than we typically think about in scientific computations.

Now, with multiple cores, you get a significant runtime benefit for your computations if you use multiple threads.  If you have four cores on your CPU, then you can, under ideal conditions, run your programs four times faster than you could on a single core.

Dealing with this kind of parallelism is very hard, though (and now I am finally getting to the actual question).

The problem, as I see it, is that our brains just find it hard to think about parallelism.  I know, it sounds weird, ’cause in the real world everything happens concurrently and we deal with it.

On the other hand, people who wouldn’t recognise a differential equation if it sat on their lap, can still catch a ball, so clearly there are some things we find easy to deal with but hard to reason about, and concurrency is one of these things.

The problem is, that when the program solves several tasks at the same time, we need a way of coordinate these tasks.  We need to have the input available before we start processing, and deliver the output when it is needed.

This is something we know from our own life, but on the computer there is an added complexity:  the program is not smart enough to know that it doesn’t have the data it needs, and will happily process garbage input if the real input isn’t available.  Likewise, it will not think twice about overwriting important current data with bogus outdated data, if it gets behind the main program.

We have to explicitly program how the communication and synchronisation rules should be, and our brains are pretty bad at reasoning about this.

We forget about important situations where synchronisation is necessary and we sometimes add rules that lets each thread wait for some other thread to complete, so no thread can actually continue its work and the whole system is deadlocked.

There are rules you can follow to avoid these problems, but even very experienced people find it hard and make mistakes.

On a side note, a large field in theoretical computer science — concurrency theory — works on ways to deal with this problem: rules for constructing programs so the errors are avoiding, or methods for
analysing models of systems to prove that they have the intended behaviour.  (My PhD work was on the later).

Unfortunately, most of theoretical work is far from being usable in real programming situations.  So in practise we are relying on rules of thumbs and experience, and neither works that well.

Of course, all the problems with multi-threaded programs just get much, much harder on multi-computer systems (clusters or networks of services).  There the synchronisation is even worse, and on top of that individual computers can crash and the system must be able to
deal with that.

Leave that to the computer scientists for now…

If you are not experienced in parallel programming, and your interests are in, say, biology and not computer science, you will probably find dealing with concurrency issues a pain in the neck.  You just want the computer to crunch your numbers with the equations you’ve given it, and really you shouldn’t have to worry about how it does it.

Programming languages do not yet hide this kind of parallelism.  We have high-level constructions to describe our mathematics (to varying degrees), but when it comes to parallel execution, we are just not there yet.

R/parallel is a small step in that direction.  It gives you a language construction for executing what would otherwise be a sequential loop, in parallel.  It then deals with distributing the tasks to threads and synchronising them, so you, as the programmer, won’t have to worry about it.

This idea is not new.  There has been programming languages with such constructions for ages.  It just hasn’t made it into the main-stream languages.

How should one go about deciding what parts of code to make parallel and what parts to leave alone? As you say, the BMC paper leaves it up the researcher to decide, but do you think this might be beyond the average user and is better left to parallel programming experts?

To be absolutely honest, even the experts probably wouldn’t know where to parallelise just by looking at a program.

Just like it is notoriously hard to reason about where the bottlenecks are in a sequential programs, and how to get around them, it is hard to reason about where parallelisation will give you a boost.

The good news is that it isn’t that hard to figure it out.  All you have to do is profile your program, and then you know the hotspots you need to focus on.

Trial and error will show you where you get a performance improvement by parallelising (and with R/parallel there isn’t much work involved in doing this testing, compared to programming it up manually).

The experts have done this many times before and might have a better intuition about where to try parallelisation first, but really I would say that knowing the problem the program is solving — and thereby knowing where the real number crunching is going on — is just as helpful.

This is not where we want to invoke computer scientists.  If they just give us to tools to experiment without much of an overhead, then we’re fine.

You mention there are some limitations with parallelizing R through an add-on package such as the one described in the paper vs. what you describe on your post, can you explain the difference in layman’s terms?

An add-on package can give you a new abstraction, like in R/parallel, that lets you tell R to “do this stuff here in parallel rather than in sequence”.

It is a lot easier to use such an abstraction than to program the parallelisation yourself, but it still leaves it up to you to worry about parallelisation.

Ideally, you just want to tell the computer which equations to solve, and not have to worry about how it does it.

Although you might not think this when you are programming, you are a lot closer to this ideal than you could be.  You might think that you have to tell the computer exactly how to solve a problem, because you have to worry about loops and exceptions and whatnot, but you are actually very far removed from the actual computations.

With high-level languages, you tell the program to perform some operation (say exponentiating) on all elements in a list.  The operations that the computer actually have to do are at a much lower
level.

To take the same example in the context of parallelising a program, you should be able to tell the computer to exponentiate all elements in a list and not have to worry about whether it needs to do
it in parallel or in sequence.

You want to be able to just write

> result <- exp(v)

and not worry about whether you want

> result <- exp(v)

or

> result <- run.parallel(exp,v)

Of course, if we need all the trial and error to figure out when parallelisation is worthwhile, can we trust the computer to make the decision for us?

If you just compile the program and have to make the decision without knowing about performance hotspots, then no.

But here’s the thing: when we are running the program, we can profile it, and then we know the hotspots, so we can at that point replace the sequential execution with parallel execution where it will actually improve performance.

Modern virtual machines already do something similar for code generation.  You might interpret byte-code in your virtual machine, but the performance-critical parts will quickly be recognised and compiled into machine code, so those parts are executed much faster than they would otherwise be.

The virtual machines does much more than that, though: they recognise the common cases in your methods and optimise for them.  Remember the virtual functions I mentioned above? Since there is a high penalty with those, a virtual machine can remove them by generating code where the function is not looked up in a table but directly in the code. When the virtual machine recognises that it is running the common-case scenario, it runs that generated code (with an efficient function call), and otherwise the generic code (which is slower, but won’t be called that often).

To put it simply: when we are running our programs we know where the bottlenecks are, and can optimise them.  This requires that the virtual machine does the profiling and optimisation, though, and is not something you can just add onto it with a library.

Do you think that some of the currently available commercial parallel tool kits, such as he ones offered by Interactive Supercomputing and Revolution Computing, which both offer  ways to parallelize R, offer something more powerful or robust than what’s being described in R/parallel paper?

I am not familiar with those tool kits, so I don’t know.  What is offered in R/parallel is pretty basic.  It is the first small step in the right direction.  If you have money and time to throw at it, it shouldn’t be much of a problem to improve it a lot, so I wouldn’t be surprised if commercial packages are better.

They won’t necessarily stay better, though.  If an Open Source project for parallelising R really takes off (like R itself has done), then there is a lot of motivated programmers working on it, and that is
hard to keep up with as a company.

R/parallel

Tuesday, September 30th, 2008

ResearchBlogging.org

There’s a paper that just got out in BMC Bioinformatics that I found an interesting read.

R/parallel – speeding up bioinformatics analysis with R
Gonzalo Vera, Ritsert C Jansen and Remo L Suppi

BMC Bioinformatics 2008, 9:390 doi:10.1186/1471-2105-9-390

Background

R is the preferred tool for statistical analysis of many bioinformaticians due in part to the increasing number of freely available analytical methods. Such methods can be quickly reused and adapted to each particular experiment. However, in experiments where large amounts of data are generated, for example using high-throughput screening devices, the processing time required to analyze data is often quite long. A solution to reduce the processing time is the use of parallel computing technologies. Because R does not support parallel computations, several tools have been developed to enable such technologies. However, these tools require multiple modications to the way R programs are usually written or run. Although these tools can finally speed up the calculations, the time, skills and additional resources required to use them are an obstacle for most bioinformaticians.

Results

We have designed and implemented an R add-on package, R/parallel, that extends R by adding user-friendly parallel computing capabilities. With R/parallel any bioinformatician can now easily automate the parallel execution of loops and benefit from the multicore processor power of today’s desktop computers. Using a single and simple function, R/parallel can be integrated directly with other existing R packages. With no need to change the implemented algorithms, the processing time can be approximately reduced N-fold, N being the number of available processor cores.

Conclusions

R/parallel saves bioinformaticians time in their daily tasks of analyzing experimental data. It achieves this objective on two fronts: first, by reducing development time of parallel programs by avoiding reimplementation of existing methods and second, by reducing processing time by speeding up computations on current desktop computers. Future work is focused on extending the envelope of R/parallel by interconnecting and aggregating the power of several computers, both existing office computers and computing clusters.

It concerns an extension module for R that helps parallelising code on multi-core machines.

This is an important issue.  Data analysis is getting relatively slower and slower, as the data size increases faster than the improvements in CPU speed, so exploiting the parallelism in modern CPUs will get increasingly important.

It is not quite straightforward to do this, however.  Writing concurrent programs is much harder than writing sequential programs, and that is hard enough as it is.

Problems with concurrent programs

The two major difficulties in programming is getting it right and making it fast and both are much more difficult with parallel programs.

When several threads are running concurrently, you need all kinds of synchronisation to ensure that the input of one calculation is actually available before you start computing and to prevent threads from corrupting data structures by updating them at the same time.

This synchronisation is not only hard to get right, it also carries an overhead that can be very hard to reason about.  Just like it is difficult to know which parts of a program is using most of the CPU time without profiling, it is difficult to know which parts of a concurrent program are the synchronisation bottlenecks.

Hide the parallelism as a low-level detail

Since concurrency is so hard to get right, we would like to hide it away as much as we can. Just like we hide assembler instructions away and program in high-level languages.

This is by no means easy to do, and the threading libraries in e.g. Python and Java are not even close to doing that.

In a language such as R or Matlab, you potentially have an easier way of achieving it. A lot of operations are “vectorized”, i.e. you have a high-level instruction for performing multiple operations on vectors or matrices.  Rather than multiplying all elements in a vector using a loop

> v <- c(1,2,3)
> w <- v
> for (i in 1:3) w[i] <- 2*w[i]
> w
[1] 2 4 6

you do the multiplication in a single operation

> 2*v
[1] 2 4 6

and rather than multiplying matrices explicitly in a triple loop,

> A <- matrix(c(1,2,3,4),nc=2) ; B <- matrix(c(4,3,2,1),nc=2)
> C <- matrix(0,nc=2,nr=2)
> for (i in 1:2) for (j in 1:2) for (k in 1:2) C[i,j] <- C[i,j]+A[i,k]*B[k,j]
> C
     [,1] [,2]
[1,]   13    5
[2,]   20    8

you have a single operation for it.

> A %*% B
     [,1] [,2]
[1,]   13    5
[2,]   20    8

It is almost trivial to parallelise such operations, and if your program consists of a lot of such high-level operations, much program parallelisation can be automated.

R/parallel

This is the reasoning behind the BMC Bioinformatics paper.

The are not exactly doing what I describe above — that would be the “right” thing to do, but would require changes to the R system that you cannot directly do through an add-on package — but they provide an operation for replacing sequential loops with a parallel version.

Just replacing a sequential loop with parallel execution, assuming that the operations are independent, is always safe. The behaviour of the sequential and the parallel program is exactly the same, except for the execution time.

As such there is no extra mental overhead in introducing it to your programs.

Using it won’t necessarily speed up the program, of course. Even if the synchronisation is hidden from the programmer, the overhead is still there.

The authors leave it to the programmer to know when the parallel execution pays off (and profiling should tell him so).

It is quite likely that a sequence of small fast tasks is parallelized and, despite parallel execution, as a result of the transformation process and additional synchronization, the overall processing time can be increased. To avoid this situation, the design decision made is to let the users indicate which code regions (i.e. loops) they need to speed up.

Knowing what to run in parallel and what not to is a hard problem.  It will often depend on the data as well, if nothing else the data size.

A modern virtual machine would be able to profile the execution of the program and make a decision based on that.  Assuming it is operations that are executed more than once.

I don’t know if any virtual machines are actually doing this, though. I really have no idea how much automatic parallelisation is part of the back end of virtual machines.

I would love to know, though.  This is the way to go, to get the speedups of parallelisation without too high a mental burden for the scientist/programmer.


Gonzalo Vera, Ritsert C Jansen, Remo L Suppi (2008). R/parallel – speeding up bioinformatics analysis with R BMC Bioinformatics, 9 (390)