Genome Technology interview
Wednesday, October 15th, 2008Last week I gave an interview to Genome Technology about R/parallel (that I've blogged about a few days earlier).
I don't know if the article about R/parallel is out yet -- I haven't seen it -- but below you can see the questions I got and the answers I gave...
The interview
When it comes to parallelizing software such as R, what are the inherent challenges beyond the average bench biologists?
There is a lot of parallelism going on in modern hardware, most of which you never worry about. The compilers and CPUs take care of it for you. This has to do with how data and program instructions float around on the silicon, and usually not something you want to know about unless you develop hardware or compilers. When you do notice it, what you usually notice is how data from main RAM is vastly slower to access than data in the cache. If you are careful with your data structures you can avoid this problem in most cases by just following a few rules of thumb (and all that means is that you should try to keep data that is accessed together, close to each other in memory address).
In languages like C++ you will also notice how virtual functions are much slower to call than non-virtual functions. This is also a consequence of parallelism. When you are going to call a function -- which means that instructions now needs to be read from a different place in memory -- the CPU can see where you are going and start fetching the instructions. With virtual functions, you don't know where you are going until you've computed exactly where you want to execute (this is called virtual table lookup, but it just means that you are computing a point in the instructions to jump to, rather than have that point explicit in the instructions). In that case, the CPU cannot start fetching the instructions, so there will be a delay in the function call.
But so far I am not really answering the question; I'm just warming up to that.
I just wanted to make it clear that parallel processes is not new to computing and doesn't need to be something we have to worry about. Except that we do have to worry about it right now, until systems developers can hide it away under layers of abstraction -- just like R/parallel is trying to do.
Why do we have to worry about it now?
To make CPUs faster it is no longer sufficient to just add more and more transistors, closer and closer together, on a chip. There is a physical limit to how much longer we can do that.
If we cannot increase the speed of the individual processors, then at least we can do more in parallel, but now the software needs to help the hardware. The parallelisation needs to be dealt with in the
software layer.
On your desktop computer, you are seeing this in several places.
- The CPU has instructions, so called SIMD, for performing the same operation on multiple data in parallel.
- The CPU has multiple cores (it is hard to find one now with less than two and soon that will be four)
- Your graphics card is a very powerful computer in itself (the processor there is called a GPU, and is a bit harder to program than a CPU, but that will change over time).
I'm not going to comment on 3. any further. Using GPUs for scientific computing is very hot these days, but is probably best left to the computer scientists for now.
The parallelisation in 1. is relatively easy to work with as a programmer. There are some hardware considerations to deal with, like how data should be formatted in memory, but it is not conceptually hard to deal with.
Even better, it can be automatically applied to a large degree in languages like R. In C/C++ for example, you don't have high-level constructs to perform an operation on all elements in a vector (which is the kind of operations ideal for this kind of parallelisation). In R you do, so whereas it can be hard for a C/C++ compiler to automatically use this kind of parallelisation, in R it would be
almost trivial to apply it. This is what I wrote about in my blog post.
The parallelisation in 2. is more difficult to deal with.
Even if computations have been running on highly parallelised hardware for a while, conceptually the programs have been "single threaded". At every point of time in the computation, conceptually a single instruction is being executed, and the state of the program is deterministically determined by the input and the instructions executed so far.
With multiple cores, the program can run several threads in parallel, and the program is now conceptually parallelised.
Multiple threads are not new to computer science. We have had them as a concept for decades. Even when the actual processor could only execute a single instruction at a time, it would sometimes be beneficial to think of the program as running in parallel. And for various reasons that i wont go into, you could also get significant speed improvements, but for different reasons than we typically think about in scientific computations.
Now, with multiple cores, you get a significant runtime benefit for your computations if you use multiple threads. If you have four cores on your CPU, then you can, under ideal conditions, run your programs four times faster than you could on a single core.
Dealing with this kind of parallelism is very hard, though (and now I am finally getting to the actual question).
The problem, as I see it, is that our brains just find it hard to think about parallelism. I know, it sounds weird, 'cause in the real world everything happens concurrently and we deal with it.
On the other hand, people who wouldn't recognise a differential equation if it sat on their lap, can still catch a ball, so clearly there are some things we find easy to deal with but hard to reason about, and concurrency is one of these things.
The problem is, that when the program solves several tasks at the same time, we need a way of coordinate these tasks. We need to have the input available before we start processing, and deliver the output when it is needed.
This is something we know from our own life, but on the computer there is an added complexity: the program is not smart enough to know that it doesn't have the data it needs, and will happily process garbage input if the real input isn't available. Likewise, it will not think twice about overwriting important current data with bogus outdated data, if it gets behind the main program.
We have to explicitly program how the communication and synchronisation rules should be, and our brains are pretty bad at reasoning about this.
We forget about important situations where synchronisation is necessary and we sometimes add rules that lets each thread wait for some other thread to complete, so no thread can actually continue its work and the whole system is deadlocked.
There are rules you can follow to avoid these problems, but even very experienced people find it hard and make mistakes.
On a side note, a large field in theoretical computer science -- concurrency theory -- works on ways to deal with this problem: rules for constructing programs so the errors are avoiding, or methods for
analysing models of systems to prove that they have the intended behaviour. (My PhD work was on the later).
Unfortunately, most of theoretical work is far from being usable in real programming situations. So in practise we are relying on rules of thumbs and experience, and neither works that well.
Of course, all the problems with multi-threaded programs just get much, much harder on multi-computer systems (clusters or networks of services). There the synchronisation is even worse, and on top of that individual computers can crash and the system must be able to
deal with that.
Leave that to the computer scientists for now...
If you are not experienced in parallel programming, and your interests are in, say, biology and not computer science, you will probably find dealing with concurrency issues a pain in the neck. You just want the computer to crunch your numbers with the equations you've given it, and really you shouldn't have to worry about how it does it.
Programming languages do not yet hide this kind of parallelism. We have high-level constructions to describe our mathematics (to varying degrees), but when it comes to parallel execution, we are just not there yet.
R/parallel is a small step in that direction. It gives you a language construction for executing what would otherwise be a sequential loop, in parallel. It then deals with distributing the tasks to threads and synchronising them, so you, as the programmer, won't have to worry about it.
This idea is not new. There has been programming languages with such constructions for ages. It just hasn't made it into the main-stream languages.
How should one go about deciding what parts of code to make parallel and what parts to leave alone? As you say, the BMC paper leaves it up the researcher to decide, but do you think this might be beyond the average user and is better left to parallel programming experts?
To be absolutely honest, even the experts probably wouldn't know where to parallelise just by looking at a program.
Just like it is notoriously hard to reason about where the bottlenecks are in a sequential programs, and how to get around them, it is hard to reason about where parallelisation will give you a boost.
The good news is that it isn't that hard to figure it out. All you have to do is profile your program, and then you know the hotspots you need to focus on.
Trial and error will show you where you get a performance improvement by parallelising (and with R/parallel there isn't much work involved in doing this testing, compared to programming it up manually).
The experts have done this many times before and might have a better intuition about where to try parallelisation first, but really I would say that knowing the problem the program is solving -- and thereby knowing where the real number crunching is going on -- is just as helpful.
This is not where we want to invoke computer scientists. If they just give us to tools to experiment without much of an overhead, then we're fine.
You mention there are some limitations with parallelizing R through an add-on package such as the one described in the paper vs. what you describe on your post, can you explain the difference in layman's terms?
An add-on package can give you a new abstraction, like in R/parallel, that lets you tell R to "do this stuff here in parallel rather than in sequence".
It is a lot easier to use such an abstraction than to program the parallelisation yourself, but it still leaves it up to you to worry about parallelisation.
Ideally, you just want to tell the computer which equations to solve, and not have to worry about how it does it.
Although you might not think this when you are programming, you are a lot closer to this ideal than you could be. You might think that you have to tell the computer exactly how to solve a problem, because you have to worry about loops and exceptions and whatnot, but you are actually very far removed from the actual computations.
With high-level languages, you tell the program to perform some operation (say exponentiating) on all elements in a list. The operations that the computer actually have to do are at a much lower
level.
To take the same example in the context of parallelising a program, you should be able to tell the computer to exponentiate all elements in a list and not have to worry about whether it needs to do
it in parallel or in sequence.
You want to be able to just write
> result <- exp(v)
and not worry about whether you want
> result <- exp(v)
or
> result <- run.parallel(exp,v)
Of course, if we need all the trial and error to figure out when parallelisation is worthwhile, can we trust the computer to make the decision for us?
If you just compile the program and have to make the decision without knowing about performance hotspots, then no.
But here's the thing: when we are running the program, we can profile it, and then we know the hotspots, so we can at that point replace the sequential execution with parallel execution where it will actually improve performance.
Modern virtual machines already do something similar for code generation. You might interpret byte-code in your virtual machine, but the performance-critical parts will quickly be recognised and compiled into machine code, so those parts are executed much faster than they would otherwise be.
The virtual machines does much more than that, though: they recognise the common cases in your methods and optimise for them. Remember the virtual functions I mentioned above? Since there is a high penalty with those, a virtual machine can remove them by generating code where the function is not looked up in a table but directly in the code. When the virtual machine recognises that it is running the common-case scenario, it runs that generated code (with an efficient function call), and otherwise the generic code (which is slower, but won't be called that often).
To put it simply: when we are running our programs we know where the bottlenecks are, and can optimise them. This requires that the virtual machine does the profiling and optimisation, though, and is not something you can just add onto it with a library.
Do you think that some of the currently available commercial parallel tool kits, such as he ones offered by Interactive Supercomputing and Revolution Computing, which both offer ways to parallelize R, offer something more powerful or robust than what's being described in R/parallel paper?
I am not familiar with those tool kits, so I don't know. What is offered in R/parallel is pretty basic. It is the first small step in the right direction. If you have money and time to throw at it, it shouldn't be much of a problem to improve it a lot, so I wouldn't be surprised if commercial packages are better.
They won't necessarily stay better, though. If an Open Source project for parallelising R really takes off (like R itself has done), then there is a lot of motivated programmers working on it, and that is
hard to keep up with as a company.