Archive for October, 2008

How did I ever consider Linux stable?

Thursday, October 16th, 2008

I’ve been working on my Ubuntu Linux laptop the last couple of days, and I f*cking hate it.  It seems to me that whenever I start a memory intensive job, the very next thing I do is to reboot it using the power button.  It takes it a millisecond to reach the point where it is absolutely hopeless to try to interact with it.  It is a swapping hell.

Now, I have always had problems with Linux when I fill up the RAM.  In my PhD work I did explicit state space exploration, so I used massive amounts of memory and rebooting wasn’t that unusual.  Then again, I think I could crash most other operating systems as well ;-)

These days I don’t even have to try to crash the box.  Compiling a complex enough program, or leaving firefox on for a few hours, should do the trick.  I swear, I am rebooting Linux three or four times a day!  Usually loosing a bit of work each time, of course.

Have I fucked something up on my box?  It just didn’t use to be this way…

It is not really a problem I have on the other Linux boxes I use, to be fair.  On my desktop Fedora, I won’t have time to fill up the RAM before it crashes.  It doesn’t like my graphics card, you see, so it will typically crash within 5 minutes of me logging in.  That is part of the reason I am working on my laptop.

As long as I don’t log into the box through X, but ssh in instead, it is pretty stable though.

Still, I am so fed up with Linux.  I don’t have to reboot Windows every two-three ours, and I’ve only had to reboot my Mac a few times in all the time I’ve had it.

Linux is supposed to be stable.  It is praised to be.  Still, I am finding that it is growing increasingly unstable on me.

It might just be me, of course.  I’m no wiz systems administrator. I’m sure some of my geeky friends can set up a system that wont crash all the time, but I don’t have time for this if I also want to do my science. I just install Linox out of the box and it works like shit on me.

Oh well, with the new MacBook out I will soon not have this problem any longer… I will have the UNIX tools I love but on a box I can actually manage to set up.

Genome Technology interview

Wednesday, October 15th, 2008

Last week I gave an interview to Genome Technology about R/parallel (that I’ve blogged about a few days earlier).

I don’t know if the article about R/parallel is out yet — I haven’t seen it — but below you can see the questions I got and the answers I gave…

The interview

When it comes to parallelizing software such as R, what are the inherent challenges beyond the average bench biologists?

There is a lot of parallelism going on in modern hardware, most of which you never worry about.  The compilers and CPUs take care of it for you. This has to do with how data and program instructions float around on the silicon, and usually not something you want to know about unless you develop hardware or compilers. When you do notice it, what you usually notice is how data from main RAM is vastly slower to access than data in the cache.  If you are careful with your data structures you can avoid this problem in most cases by just following a few rules of thumb (and all that means is that you should try to keep data that is accessed together, close to each other in memory address).

In languages like C++ you will also notice how virtual functions are much slower to call than non-virtual functions.  This is also a consequence of parallelism.  When you are going to call a function — which means that instructions now needs to be read from a different place in memory — the CPU can see where you are going and start fetching the instructions.  With virtual functions, you don’t know where you are going until you’ve computed exactly where you want to execute (this is called virtual table lookup, but it just means that you are computing a point in the instructions to jump to, rather than have that point explicit in the instructions).  In that case, the CPU cannot start fetching the instructions, so there will be a delay in the function call.

But so far I am not really answering the question; I’m just warming up to that.

I just wanted to make it clear that parallel processes is not new to computing and doesn’t need to be something we have to worry about. Except that we do have to worry about it right now, until systems developers can hide it away under layers of abstraction — just like R/parallel is trying to do.

Why do we have to worry about it now?

To make CPUs faster it is no longer sufficient to just add more and more transistors, closer and closer together, on a chip.  There is a physical limit to how much longer we can do that.

If we cannot increase the speed of the individual processors, then at least we can do more in parallel, but now the software needs to help the hardware.  The parallelisation needs to be dealt with in the
software layer.

On your desktop computer, you are seeing this in several places.

  1. The CPU has instructions, so called SIMD, for performing the same operation on multiple data in parallel.
  2. The CPU has multiple cores (it is hard to find one now with less than two and soon that will be four)
  3. Your graphics card is a very powerful computer in itself (the processor there is called a GPU, and is a bit harder to program than a CPU, but that will change over time).

I’m not going to comment on 3. any further.  Using GPUs for scientific computing is very hot these days, but is probably best left to the computer scientists for now.

The parallelisation in 1. is relatively easy to work with as a programmer.  There are some hardware considerations to deal with, like how data should be formatted in memory, but it is not conceptually hard to deal with.

Even better, it can be automatically applied to a large degree in languages like R.  In C/C++ for example, you don’t have high-level constructs to perform an operation on all elements in a vector (which is the kind of operations ideal for this kind of parallelisation).  In R you do, so whereas it can be hard for a C/C++ compiler to automatically use this kind of parallelisation, in R it would be
almost trivial to apply it.  This is what I wrote about in my blog post.

The parallelisation in 2. is more difficult to deal with.

Even if computations have been running on highly parallelised hardware for a while, conceptually the programs have been “single threaded”. At every point of time in the computation, conceptually a single instruction is being executed, and the state of the program is deterministically determined by the input and the instructions executed so far.

With multiple cores, the program can run several threads in parallel, and the program is now conceptually parallelised.

Multiple threads are not new to computer science.  We have had them as a concept for decades.  Even when the actual processor could only execute a single instruction at a time, it would sometimes be beneficial to think of the program as running in parallel.  And for various reasons that i wont go into, you could also get significant speed improvements, but for different reasons than we typically think about in scientific computations.

Now, with multiple cores, you get a significant runtime benefit for your computations if you use multiple threads.  If you have four cores on your CPU, then you can, under ideal conditions, run your programs four times faster than you could on a single core.

Dealing with this kind of parallelism is very hard, though (and now I am finally getting to the actual question).

The problem, as I see it, is that our brains just find it hard to think about parallelism.  I know, it sounds weird, ’cause in the real world everything happens concurrently and we deal with it.

On the other hand, people who wouldn’t recognise a differential equation if it sat on their lap, can still catch a ball, so clearly there are some things we find easy to deal with but hard to reason about, and concurrency is one of these things.

The problem is, that when the program solves several tasks at the same time, we need a way of coordinate these tasks.  We need to have the input available before we start processing, and deliver the output when it is needed.

This is something we know from our own life, but on the computer there is an added complexity:  the program is not smart enough to know that it doesn’t have the data it needs, and will happily process garbage input if the real input isn’t available.  Likewise, it will not think twice about overwriting important current data with bogus outdated data, if it gets behind the main program.

We have to explicitly program how the communication and synchronisation rules should be, and our brains are pretty bad at reasoning about this.

We forget about important situations where synchronisation is necessary and we sometimes add rules that lets each thread wait for some other thread to complete, so no thread can actually continue its work and the whole system is deadlocked.

There are rules you can follow to avoid these problems, but even very experienced people find it hard and make mistakes.

On a side note, a large field in theoretical computer science — concurrency theory — works on ways to deal with this problem: rules for constructing programs so the errors are avoiding, or methods for
analysing models of systems to prove that they have the intended behaviour.  (My PhD work was on the later).

Unfortunately, most of theoretical work is far from being usable in real programming situations.  So in practise we are relying on rules of thumbs and experience, and neither works that well.

Of course, all the problems with multi-threaded programs just get much, much harder on multi-computer systems (clusters or networks of services).  There the synchronisation is even worse, and on top of that individual computers can crash and the system must be able to
deal with that.

Leave that to the computer scientists for now…

If you are not experienced in parallel programming, and your interests are in, say, biology and not computer science, you will probably find dealing with concurrency issues a pain in the neck.  You just want the computer to crunch your numbers with the equations you’ve given it, and really you shouldn’t have to worry about how it does it.

Programming languages do not yet hide this kind of parallelism.  We have high-level constructions to describe our mathematics (to varying degrees), but when it comes to parallel execution, we are just not there yet.

R/parallel is a small step in that direction.  It gives you a language construction for executing what would otherwise be a sequential loop, in parallel.  It then deals with distributing the tasks to threads and synchronising them, so you, as the programmer, won’t have to worry about it.

This idea is not new.  There has been programming languages with such constructions for ages.  It just hasn’t made it into the main-stream languages.

How should one go about deciding what parts of code to make parallel and what parts to leave alone? As you say, the BMC paper leaves it up the researcher to decide, but do you think this might be beyond the average user and is better left to parallel programming experts?

To be absolutely honest, even the experts probably wouldn’t know where to parallelise just by looking at a program.

Just like it is notoriously hard to reason about where the bottlenecks are in a sequential programs, and how to get around them, it is hard to reason about where parallelisation will give you a boost.

The good news is that it isn’t that hard to figure it out.  All you have to do is profile your program, and then you know the hotspots you need to focus on.

Trial and error will show you where you get a performance improvement by parallelising (and with R/parallel there isn’t much work involved in doing this testing, compared to programming it up manually).

The experts have done this many times before and might have a better intuition about where to try parallelisation first, but really I would say that knowing the problem the program is solving — and thereby knowing where the real number crunching is going on — is just as helpful.

This is not where we want to invoke computer scientists.  If they just give us to tools to experiment without much of an overhead, then we’re fine.

You mention there are some limitations with parallelizing R through an add-on package such as the one described in the paper vs. what you describe on your post, can you explain the difference in layman’s terms?

An add-on package can give you a new abstraction, like in R/parallel, that lets you tell R to “do this stuff here in parallel rather than in sequence”.

It is a lot easier to use such an abstraction than to program the parallelisation yourself, but it still leaves it up to you to worry about parallelisation.

Ideally, you just want to tell the computer which equations to solve, and not have to worry about how it does it.

Although you might not think this when you are programming, you are a lot closer to this ideal than you could be.  You might think that you have to tell the computer exactly how to solve a problem, because you have to worry about loops and exceptions and whatnot, but you are actually very far removed from the actual computations.

With high-level languages, you tell the program to perform some operation (say exponentiating) on all elements in a list.  The operations that the computer actually have to do are at a much lower
level.

To take the same example in the context of parallelising a program, you should be able to tell the computer to exponentiate all elements in a list and not have to worry about whether it needs to do
it in parallel or in sequence.

You want to be able to just write

> result <- exp(v)

and not worry about whether you want

> result <- exp(v)

or

> result <- run.parallel(exp,v)

Of course, if we need all the trial and error to figure out when parallelisation is worthwhile, can we trust the computer to make the decision for us?

If you just compile the program and have to make the decision without knowing about performance hotspots, then no.

But here’s the thing: when we are running the program, we can profile it, and then we know the hotspots, so we can at that point replace the sequential execution with parallel execution where it will actually improve performance.

Modern virtual machines already do something similar for code generation.  You might interpret byte-code in your virtual machine, but the performance-critical parts will quickly be recognised and compiled into machine code, so those parts are executed much faster than they would otherwise be.

The virtual machines does much more than that, though: they recognise the common cases in your methods and optimise for them.  Remember the virtual functions I mentioned above? Since there is a high penalty with those, a virtual machine can remove them by generating code where the function is not looked up in a table but directly in the code. When the virtual machine recognises that it is running the common-case scenario, it runs that generated code (with an efficient function call), and otherwise the generic code (which is slower, but won’t be called that often).

To put it simply: when we are running our programs we know where the bottlenecks are, and can optimise them.  This requires that the virtual machine does the profiling and optimisation, though, and is not something you can just add onto it with a library.

Do you think that some of the currently available commercial parallel tool kits, such as he ones offered by Interactive Supercomputing and Revolution Computing, which both offer  ways to parallelize R, offer something more powerful or robust than what’s being described in R/parallel paper?

I am not familiar with those tool kits, so I don’t know.  What is offered in R/parallel is pretty basic.  It is the first small step in the right direction.  If you have money and time to throw at it, it shouldn’t be much of a problem to improve it a lot, so I wouldn’t be surprised if commercial packages are better.

They won’t necessarily stay better, though.  If an Open Source project for parallelising R really takes off (like R itself has done), then there is a lot of motivated programmers working on it, and that is
hard to keep up with as a company.

Framework for whole genome sequence analysis

Friday, October 10th, 2008

I’m working on a project right now where I am analysing a whole genome alignment of human, chimp, orangutan and macaque.

I’ve downloaded the alignment from the UCSC genome browser, it is in the MAF format, and I use the bx-python module from Penn State for the analysis. That is a very nice module, by the way.

For scanning through the alignment and collecting various statistics, MAF+bx-python is a fine combination, but now I have some interesting regions I want to analyse in more depth, and then the combination is not quite good enough.

The MAF files are essentially sequential chunks of alignments, and to find a particular region in the genome, I need to scan through a large file until I find the region.

I have solved this problem by extracting the regions and put them in a number of different files, that I can then easier access.  But wouldn’t it be nice if I could just seek() to the relevant position?  In any of the species coordinates just to make it even nicer?

The bx-python module lets me extract sub-alignments, so

s = a.slice(begin,end)

extracts the sub-alignment of a beginning at column begin and ending at end-1.

(I can also get a sub-set of species, if you are wondering, but the method is slightly different for that).

I would love to be able to get a sub-alignment based on species coordinates instead of column numbers, so

s = a.slice(begin,end,coordinates='hg18')

would give me the alignment that maches the region [begin,end[ in humans (hg18 assembly).  With a proper data representation, it should also give me this without scanning linearly through the file…

Does anyone know about a multiple alignment representation that lets me look up slices in files without scanning through all the alignment leading up to that location?

Surely genome browsers do not scan linearly through genomes to display their information — it would never be fast enough…

What do people use?

Some thoughs on grid computing…

Wednesday, October 8th, 2008

Earlier this week, the LHC Computing Grid went online.  A description of the system can be found here, and blog posts about it here, here and here.

This got me thinking about grid computing for small scale scientists like myself.

I’ve had some experience with grid computing (see an old post about it here) but mostly I have found it too much trouble to be worth the effort.

Our typical computer use

For large projects that require years of CPU time, it is well worth the effort to set up the infrastructure to run computations on grids.  You really need it to get your the computations done, and the overhead is very small in comparison with the actual computation time.

Most of my projects — and most of the projects we do at BiRC — are a bit different.

We do need the computation power, but we are usually tinkering with our programs for most of a project — since we rarely know exactly how to analyse our data until we are mostly done with it — so we cannot just distribute a fixed version of our software and then start distributing the computations.

The typical work flow is that we write a program for our analysis, then we run the analysis and when we look at the results we find some strange results here and there. Then we extend the software to either extract more information from the data, or to fix a bug that caused the weird results.

We then need to run the analysis again, and repeat the process.

The analysis might take a few CPU days to a few CPU months — so it is small scale for grid applications — but between each analysis we spend a week or so modifying and testing our software.

We have a small cluster of Linux computers for this, and it is always in one of two states: completely overloaded or burning idle cycles.

This is the situation grid computing could fix.  Theoretically we should be able to get CPU cycles off the grid when we need it, and sell it to the grid when we are not running computations ourselves.

In practice, our work pattern makes this difficult.

The problems with small scale grid computing

If you are changing your software all the time, you need to distribute it together with the data you analyse.

This means you either send compiled binaries with the job submissions, or you compile the software as part of the job.

The former is fine if you have a program you can compile — and you’d better link it statically ’cause there is no guarantees about the libraries you can find on the resources that will run it.

If you have a bunch of scripts, you are not so lucky.

There are no guarantees that the computer that will run the computations has the script interpreter — or if it does that it is a version that can run your script — and even if it does, what about the modules you need?

You don’t want to have to compile BioPython or SciPy on a grid machine just to run your scripts.  The overhead in CPU time is going to be several percentage of your actual run (at least if you parallelise your computations to high enough a degree to be worth the grid in the first place), and how can you even know that there is a compiler to compile it at the other end?  You can’t, and there probably isn’t unless you are very lucky.

It is a major pain to see your jobs aborted after slowly making their way through the job queue, just because the host computer cannot even setup the environment you need for your computations.

What can we do about it?

If we want to use the grid for even smaller scale computations, at the very least we need an easier way to distribute new versions of our programs.

I have an idea for this.

Some grids, at least, are already dealing with “runtime environments” where you can specify that your job needs to run in a certain runtime environment, and the scheduler will only send your jobs to resources that can provide that environment.

This sounds like just the thing, but the catch is that it is up to the resource administrators to set up these environments and to tell the grid system that they provide them.

For something like LHC, it is probably not a problem to convince administrators to provide the right environment, but for Thomas Mailund it is.

What we need is a way for the grid users to be able to install environments on the resources!

So how about this: we introduce the concept of “runtime environment packages” that we can upload to the grid system.  They consists of a setup script (configure ; make) and a test suite, for example.

When a resource is idle, it tests if there are new environments available in queue, downloads these, and tries to build and test them.  If it succeeds, it informs the grid system that it can run the new type of environment.  The scheduler only sends jobs to resources that have the right environments, so if your environment tests are working properly, you never end up on a resource that cannot run your jobs.

We could even add environment requirements on the environment packages, so they don’t have to be self-contained.  E.g. to install SciPy, you don’t want to have to install Python itself, and there is no reason for resources without Python to try to install it only to give up.

To prevent resources to be filled up with old environment, we can add a time out period to environements, so they are deleted when they haven’t been used for a couple of days/weeks/months.

It shouldn’t be that hard to implement.  I am sure I could do it, but I don’t have my own grid infrastructure to work with, so I guess I’ll have to intimidate persuade someone else to do it…

StatAlign: a new statistical alignment tool

Wednesday, October 8th, 2008

ResearchBlogging.orgThere’s an application note in the current issue of Bioinformatics that describes a new tool for statistical alignment, StatAlign, developed in my old group in Oxford.

StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees

Ádám Novák , István Miklós, Rune Lyngsø and Jotun Hein

Bioinformatics 2008 24(20):2403-2404

Motivation: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods.

Results: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed.

I am personally a firm believer in statistical alignment.  I think it is the way to go, to deal with the uncertainty in inferred alignments and to avoid the artefacts they can create.

For a good introduction to the problems (and how statistical approaches to alignment can help), you should read Lunter et al. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment Genome Res. 18:298-309, 2008 (or my summary of it here).

StatAlign, the tool in the application note, looks like a nice way to attack alignments. Unlike previous approaches I’ve blogged about — and unlike my own small work in statistical alignment — it deals with multiple sequences (where MCMC is needed besides just HMMs).

It samples over both alignments and phylogenies, which is nice if there is any uncertainty in the phylogeny inference (which is typically based on alignments in the first place).

I can imagine that integrating over the phylogenies in the MCMC is the main time-killer, though, so it could be nice if you can turn that part of the state space exploration off in case you have a reasonable idea about the phylogeny but you are uncertain about some parts of the alignment…


A. Novak, I. Miklos, R. Lyngso, J. Hein (2008). StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees Bioinformatics, 24 (20), 2403-2404 DOI: 10.1093/bioinformatics/btn457