Code rot?

I’ve just released a new version of QuickJoin. I just needed to add a tiny little feature, so it wasn’t that much of a deal, but I was horrified by the code. QuickJoin is from 2003 and one of the first applications I wrote in C++. The first two were QDist and SplitDist, and I dare not look at the code in those.

Did someone go in and change my code, or was I really that bad at C++ back then? Had I even heard of std::auto_ptr<>?

Social networks, Web 2.0 and stuff…

Honestly, I do not spend too much time on social networks like Facebook. I have an account there because Saskia invited me, but I don’t really go there unless someone writes a message to me.

Still, today I joined yet another such network, PLURK. In my defense, it was Amir who invited me and I was curious to see the app. he’s been working on the last couple of months. It is essentially just a timeline where you can write what you are doing at the moment and friends can keep track of each other that way.

I don’t see the point of this, but then I don’t see the point of Facebook either and that has turned out to be a succesfull business, so who knows? Maybe Amir will make a fortune on this and never have to take any of my classes after all…

RReportGenerator : Automatic reports from routine statistical analysis using R

Is something like this really useful?

RReportGenerator : Automatic reports from routine statistical analysis using R

W. Raffelsberger et al.

Bioinformatics Advance Access published online on November 24, 2007

With the establishment of high-throughput screening methods there is an increasing need for automatic analysis methods. Here we present RReportGenerator, a user-friendly portal for automatic routine analysis using the statistical platform R and Bioconductor. RReportGenerator is designed to analyze data using predefined analysis scenarios via a graphical user interface (GUI). A report in pdf-format combining text, figures and tables is automatically generated and results may be exported. To demonstrate suitable analysis tasks we provide direct web-access to a collection of analysis scenarios for summarizing data from transfected cell arrays (TCA), segmentation of CGH data, and microarray quality control and normalization.

I haven’t tried the package they describe, but it sounds like it is wrapping R for doing analysis from a GUI and then producing a PDF report from the results.

When I use R, I usually do not know exactly how to analyse my data, so it is always very exploratory and there is no way I could automate that. But then I am probably not the kind of user this package is aimed at, and I can certainly recognize the kind of R users that would be better off being sheltered from the gory details of R behind a GUI…

I don’t know, maybe I’ll try it out some time.


The citation for Research Blogger:
Raffelsberger, W., Krause, Y., Moulinier, L., Kieffer, D., Morand, A., Brino, L., Poch, O. (2007). RReportGenerator: automatic reports from routine statistical analysis using R. Bioinformatics, 24(2), 276-278. DOI: 10.1093/bioinformatics/btm556

I wonder when they’ll tell me when to teach

The new term starts next week. I will teach a course on systems biology. I have no idea when! I don’t even know if it has been scheduled yet.

Classes at the computer science department are scheduled there — I teach two courses in that department, but none of them the coming term. When the classes are scheduled, usually a week or two before the term starts, they are put on a web-page: http://www.daimi.au.dk/courses/schedules/, so the schedule is always easy to find.

My remaining classes are scheduled somewhere else, but I don’t know where. I thought it was at the faculty of science, at least I’ve been told so, but I’ve also mail the students office and was told that the classes were scheduled at the various departments. I know that we do not schedule the courses at BiRC, so now I wonder where my courses are scheduled, if at all.

On the faculty web-pages you have to be a bit inventive when searching for classes in bioinformatics. They put them under different (but I suspect random) departments. There is a heading called Bioinformatics, but that only contains the course descriptions. The class schedules are put all over the place. My last class was under biology, the one before under statistics, and so on.

I guess I should be greatful that the course descriptions are under bioinformatics — previously they (but only they) were labelled “interdisciplinary”. Only on the Danish pages, though, bioinformatics wasn’t even mentioned on the English pages.

Anyway, it usually takes a bit of web-searching to find out when to teach (and forget about using the search feature on the faculty web-page, it has never managed to find what I’ve been searching for).

This time around, my search ended up on an empty page. Does that mean that the schedule hasn’t been made yet, or that I’ve found the wrong page? Who knows?

Until I find out when I’ll be teaching I cannot plan my time for the coming weeks, and I have to schedule a few meetings.

This blows!

Let’s kill desktop computing

I don’t want to get rid of the desktop computer. I like it. It is a nice interface for communicating with my computation tasks, not to mention messaging, emailing, blogging, image editing etc. It is just that I don’t want my desktop computer to run most of my computation tasks. The desktop computer should be for interacting with computations, but there is really no reason why it should also carry out those computations.

A typical situation for me is that either I have very light computational requirements — what is needed for text processing or maybe compiling a TeX document — or I need a lot of computer power — when I am running my data analysis in my research. I don’t think that this is atypical for scientists, at least it is a situation I share with most of the people at BiRC.

My desktop computers are not powerful enough to deal with my scientific computing — well, they are, but it takes ages to run on them — but they have plenty of power to spare when I am just doing “office work”.

Grid computing

The solution has been around for years and is called grid computing. Even way back when I was teaching networks and distributed systems at the computer science department, we would cover the ideas behind grid computing (back then it was really just client/server architectures, RPC and later RMI, distributed file systems etc., but the ideas that are now called grid computing were around).

What we really want is to connect all the computers on the net into one big honking system where we can get the computer power we need, when we need it, from all those machines that are idle anyway. On the rare occasions where we need a super computer, we want to be connected to that as well. Of course, we do not want to pay for having a whole network of computers just standing around waiting for us to need them — much less having a super computer sitting idle waiting for us — but when we need the computer power, we want to be connected to it.

SUN tried to sell the idea with the slogan the network is the computer. I don’t really know how well that went, but I haven’t heard the slogan for years, so that successful it can’t have been.

The grid is such a great idea, so why isn’t it widespread already? Why am I still using my personal desktop computer to run my computations?

Personal experiences

I’ve had a bit experience with grid computing myself. While developing GeneRecon, I needed a lot of computers to test the software — pretty time consuming in itself and I needed to explore a large parameter space — so I got access to NorduGrid. It was a horrible experience. Setting up the grid to run my own software was such a hassle and never really worth the (limited) CPU cycles I got out of it.

Then I got access to the new Minimal intrusion Grid (MiG) developed by Brian Vinter’s group. That was an improvement over NorduGrid, and good enough to finish the GeneRecon experiments. See

Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.

for details.

It was an improvement, but it wasn’t a great experience.

Running programs on MiG requires a lot of extra work. First input files must be uploaded to the grid. Also the executable for the program, if I haven’t uploaded it already (and I’m ignoring problems with figuring out where the executable can actually be executed and such). The the job must be specified through a configuration language and submitted to the grid. When the job is executing I have to poll it from time to time to get its status. When done, I have to download the output files and clean up after the job.

Compare that to just running the program on my own computer.

I have used MiG for a couple of projects now, but for day to day work, it is just too much of a hassle.

Does it have to be so hard?

Why shouldn’t it be just as easy to run programs on the grid as on the desktop computer?

I know, if I want a distributed system with all the bells and whistles, then it is a more complicated problem than writing single machine applications, but for the cases where I just want sufficient computer power to fire off a few independent computations in parallel, there shouldn’t be any problem.

There is, but there shouldn’t be!

To access files, why should I need to up- and download? I should just mount a file system in some appropriate way, right?

To run a program on the grid, couldn’t I just distribute it to another node when loading the program?

It probably isn’t quite that easy, but by wrapping my programs in proxy executables, I should be able to achieve something very similar, at least. I’ve actually played with such a system for MiG — called MyMiG — so I know that at least something in that direction can be achieved. It just needs a bit more work (which is reasonable, since my solution took a weekend to cook up).

I realize that more complex distributed applications will need more work, but with XML-RPC and SOAP and whatnot, it shouldn’t be that much of a problem to get there.

With a proper grid setup, I could get the computer resources I need, and my desktop computer would only be needed to interact with my programs, not run them. Actually, with a proper setup, I should be able to access my computations from any computer — desktop, laptop or even smart phone — everwhere.

Can we get there?

What will it take to get to that point? Does there already exist systems out there that works this way? I’ve heard Xgrid mentioned, but do not really know anything about it, does anyone know how it works?

Last week, google annouced that they would offer free storage of scientific data. Would it be too optimistic to think that within a year, some company would offer free grid computation resources? It doesn’t even have to be completely free, you could imagine a setup where you provide your “screensaver” CPU cycles — like seti@home etc. — for access to the grid for your own tasks. With an open platform for this, shouldn’t the open source community then be able to build a great interface to it?