Archive for June, 2009

Virtualisation on the cloud

Saturday, June 13th, 2009

Yesterday I was at an interesting talk at the computer science department: Virtualization will be the Operating System of the Cloud, Steffen Grarup, VMWare

Virtualization is a neat approach to cloud computing.  You bundle up your application with the OS and whatever else it needs and pack it in a VMWare virtual machine, and then it can run on any machine on the cloud.  Well, any Intel or AMD machine, I guess, I don’t know what else, but of course there are some hardware limits.  The cool thing is that the dependencies on the software on the computers “out there” – that you have very little control over – is virtually gone.

I expect there to be a lot of devils in the details – like what happens if your code depends on e.g. SSE instruction sets that are not available on all processors, or what happens if your code thinks it has two or four cores and optimises its thread usage for that, but the real hardware has fewer – but a neat idea it is.

164-169=-5

Dinosaurs may not be the ancestor of birds

Saturday, June 13th, 2009

Story here: Discovery Raises New Doubts About Dinosaur-bird Links

Researchers at Oregon State University have made a fundamental new discovery about how birds breathe and have a lung capacity that allows for flight – and the finding means it’s unlikely that birds descended from any known theropod dinosaurs.

The conclusions add to other evolving evidence that may finally force many paleontologists to reconsider their long-held belief that modern birds are the direct descendants of ancient, meat-eating dinosaurs, OSU researchers say.

Sadly I know too little to really follow the arguments, so I am not sure how strong they are, but it is interesting nonetheless.

164-168=-4

SciPy at last!

Thursday, June 11th, 2009

Scipy is probably the software I have missed the most since I moved to OS X.  I have used it a lot for data analysis whenever the data wasn’t in a simple table form idea for R.  It really is a wonderful module if you do any kind of scientific computing in Python.

I have tried several times to install it on my Mac(s) but with limited success.  On my Mac at the office I got most of it installed, but never managed to get matplotlib working.  On my iMac at home and on my MacBook I could get numpy up and working but not scipy proper.

Until now, that is.  I found this wonderful package: Scipy Superpack.  Well, it is really just a shell script, but it just installs scipy and a bunch of other modules, so now I finally have my old Python toolbox up and running again.

162-167=-5

Pandemic

Thursday, June 11th, 2009

It’s official, the swine flu has been upgraded to pandemic.

WHO chief Dr. Margaret Chan made the announcement Thursday after the U.N. agency held an emergency meeting with flu experts. Chan said she was moving the world to phase 6 — the agency’s highest alert level — which means a pandemic, or global epidemic, is under way.
“The world is moving into the early days of its first influenza pandemic in the 21st century,” Chan told reporters. “The (swine flu) virus is now unstoppable.”

The last pandemic — the Hong Kong flu of 1968 — killed about 1 million people. Ordinary flu kills about 250,000 to 500,000 people each year.

162-166=-4

Multi-core HMMs

Thursday, June 11th, 2009

I’m working on a project where it looks like I could end up with hidden Markov models with hundreds if not thousands of states, and with transition matrices where all entries are non-zero.  Since, for such models, the running time for just calculating the likelihood (not even optimising it) is O(m\cdot n^2) where m is the length of the input and n is the number of states, this could be a problem.  Especially since I’m working on whole genome data, so m can get pretty large as well.

It is part of our CoalHMM work, but I’ll tell you more about that later.  For this post, the actual application isn’t all that important, only the size of the model.

So, anyway, needless to say I need a fast implementation if this is ever going to work.

HMM algorithms

The basic hidden Markov model algorithms are essentially just filling in dynamic programming tables.

The “Forward” algorithm, for example, looks something like

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

There’s a few more details in a real implementation, ’cause this version will have underflows in a few steps, but this is the essential idea.

SIMD speedups

We have developed a small library with these algorithms here at BiRC (mainly implemented by one of my student programmers, Andreas) that speeds up the algorithms using single instruction, multiple data (SIMD) instructions.

The idea here is that we can calculate the sums as above by adding several entries in parallel.  For doubles we can handle two numbers in parallel, for floats we can handle four, so ideally we can speed up the algorithms by a factor of two or four.

We don’t exactly achieve that – there is some overhead involved – but it is not too bad.  The figure below compares the running time with and without SIMD for varying number of states, n, with a fixed input length m=10,000.

The number type – float or double – doesn’t matter for the plain algorithm since the same number of operations is used, so I’ve only plotted the time for doubles.

This is all good and well, but with really large HMMs I think we will need even more of a boost.

Multi-core speedups

It is not hard to get your hands on a multi-core machine these days – if anything it is hard to get a single core machine.  Not massively multi-core machines, although I expect to see those in a few years, but at least dual or quad core machines. So I was hoping to get another factor of two or four from running two or four threads.

In general, parallelising the instructions in the program is harder than parallelising the data operations as we do with the SIMD.  For one thing, there is a lot of synchronisation issues when running multiple threads, and this leads to a lot of overhead.

When filling in the dynamic programming table as above, each column in the table depends on the previous column, but each cell in the column is independent from the other cells in the column.  Thus, we should be able to process each cell in a column independently, but synchronise the threads after each column so we ensure that the previous column is always fully processed.

Since the synchronisation isn’t free, we expect that we need a large number of rows in the table to make it worthwhile, but in my project I expect hundreds to thousands of states, so perhaps there is some gain to it.

I decided to try it out, anyway.

Setting up threads and synchronising them and all that used to require some programming effort, but now that we have OpenMP it is very easy to add parallelism to a program.  To parallelise the algorithm above, you simply add a pragma:

for (int i = 0; i < m; ++i) {
#pragma omp parallel for
    for (int j = 0; j < n; ++j) {
        double sum = 0.0;
        for (int k = 0; k < n; ++k) {
            sum += T[k,j] * E[j,x[i]];
        }
        F[i,j] = sum;
    }
}

and by magic the middle loop is parallelised so the body of it is run in different threads.  Setting up the threads, assigning ranges of the sum to the different threads, synchronising them after the loop, etc. is all taken care of by the compiler.

Neat.

The results are plotted below (on my dual core iMac):

As is evident, there is a large overhead in the threaded version, so for anything below a few hundred states it is slowing the algorithm down instead of speeding it up, but eventually the parallelism does kick in.

Well, for the version without SIMD at least, which is the one I’ve plotted above.

For double SIMD, it is just worth it for the large number of states

while for floats it isn’t really worthwhile for this range of number of states

Of course, this is just the very first attempt at parallelising the algorithms, so there might still be some tricks to try.  For one thing, I have a feeling I can get rid of some of the thread overhead by adding a pragma outside the outer loop. I recall seeing something like that before but I don’t remember quite how it worked.

Anyway, I’ll play with it some more later.

162-165=-3