I wonder when they’ll tell me when to teach

The new term starts next week. I will teach a course on systems biology. I have no idea when! I don’t even know if it has been scheduled yet.

Classes at the computer science department are scheduled there — I teach two courses in that department, but none of them the coming term. When the classes are scheduled, usually a week or two before the term starts, they are put on a web-page: http://www.daimi.au.dk/courses/schedules/, so the schedule is always easy to find.

My remaining classes are scheduled somewhere else, but I don’t know where. I thought it was at the faculty of science, at least I’ve been told so, but I’ve also mail the students office and was told that the classes were scheduled at the various departments. I know that we do not schedule the courses at BiRC, so now I wonder where my courses are scheduled, if at all.

On the faculty web-pages you have to be a bit inventive when searching for classes in bioinformatics. They put them under different (but I suspect random) departments. There is a heading called Bioinformatics, but that only contains the course descriptions. The class schedules are put all over the place. My last class was under biology, the one before under statistics, and so on.

I guess I should be greatful that the course descriptions are under bioinformatics — previously they (but only they) were labelled “interdisciplinary”. Only on the Danish pages, though, bioinformatics wasn’t even mentioned on the English pages.

Anyway, it usually takes a bit of web-searching to find out when to teach (and forget about using the search feature on the faculty web-page, it has never managed to find what I’ve been searching for).

This time around, my search ended up on an empty page. Does that mean that the schedule hasn’t been made yet, or that I’ve found the wrong page? Who knows?

Until I find out when I’ll be teaching I cannot plan my time for the coming weeks, and I have to schedule a few meetings.

This blows!

Let’s kill desktop computing

I don’t want to get rid of the desktop computer. I like it. It is a nice interface for communicating with my computation tasks, not to mention messaging, emailing, blogging, image editing etc. It is just that I don’t want my desktop computer to run most of my computation tasks. The desktop computer should be for interacting with computations, but there is really no reason why it should also carry out those computations.

A typical situation for me is that either I have very light computational requirements — what is needed for text processing or maybe compiling a TeX document — or I need a lot of computer power — when I am running my data analysis in my research. I don’t think that this is atypical for scientists, at least it is a situation I share with most of the people at BiRC.

My desktop computers are not powerful enough to deal with my scientific computing — well, they are, but it takes ages to run on them — but they have plenty of power to spare when I am just doing “office work”.

Grid computing

The solution has been around for years and is called grid computing. Even way back when I was teaching networks and distributed systems at the computer science department, we would cover the ideas behind grid computing (back then it was really just client/server architectures, RPC and later RMI, distributed file systems etc., but the ideas that are now called grid computing were around).

What we really want is to connect all the computers on the net into one big honking system where we can get the computer power we need, when we need it, from all those machines that are idle anyway. On the rare occasions where we need a super computer, we want to be connected to that as well. Of course, we do not want to pay for having a whole network of computers just standing around waiting for us to need them — much less having a super computer sitting idle waiting for us — but when we need the computer power, we want to be connected to it.

SUN tried to sell the idea with the slogan the network is the computer. I don’t really know how well that went, but I haven’t heard the slogan for years, so that successful it can’t have been.

The grid is such a great idea, so why isn’t it widespread already? Why am I still using my personal desktop computer to run my computations?

Personal experiences

I’ve had a bit experience with grid computing myself. While developing GeneRecon, I needed a lot of computers to test the software — pretty time consuming in itself and I needed to explore a large parameter space — so I got access to NorduGrid. It was a horrible experience. Setting up the grid to run my own software was such a hassle and never really worth the (limited) CPU cycles I got out of it.

Then I got access to the new Minimal intrusion Grid (MiG) developed by Brian Vinter’s group. That was an improvement over NorduGrid, and good enough to finish the GeneRecon experiments. See

Experiences with GeneRecon on MiG
T. Mailund, C.N.S. Pedersen, J. Bardino, B. Vinter, and H.H. Karlsen
Future Generation Computer Systems 2007 23 580–586. doi:10.1016/j.future.2006.09.003.

for details.

It was an improvement, but it wasn’t a great experience.

Running programs on MiG requires a lot of extra work. First input files must be uploaded to the grid. Also the executable for the program, if I haven’t uploaded it already (and I’m ignoring problems with figuring out where the executable can actually be executed and such). The the job must be specified through a configuration language and submitted to the grid. When the job is executing I have to poll it from time to time to get its status. When done, I have to download the output files and clean up after the job.

Compare that to just running the program on my own computer.

I have used MiG for a couple of projects now, but for day to day work, it is just too much of a hassle.

Does it have to be so hard?

Why shouldn’t it be just as easy to run programs on the grid as on the desktop computer?

I know, if I want a distributed system with all the bells and whistles, then it is a more complicated problem than writing single machine applications, but for the cases where I just want sufficient computer power to fire off a few independent computations in parallel, there shouldn’t be any problem.

There is, but there shouldn’t be!

To access files, why should I need to up- and download? I should just mount a file system in some appropriate way, right?

To run a program on the grid, couldn’t I just distribute it to another node when loading the program?

It probably isn’t quite that easy, but by wrapping my programs in proxy executables, I should be able to achieve something very similar, at least. I’ve actually played with such a system for MiG — called MyMiG — so I know that at least something in that direction can be achieved. It just needs a bit more work (which is reasonable, since my solution took a weekend to cook up).

I realize that more complex distributed applications will need more work, but with XML-RPC and SOAP and whatnot, it shouldn’t be that much of a problem to get there.

With a proper grid setup, I could get the computer resources I need, and my desktop computer would only be needed to interact with my programs, not run them. Actually, with a proper setup, I should be able to access my computations from any computer — desktop, laptop or even smart phone — everwhere.

Can we get there?

What will it take to get to that point? Does there already exist systems out there that works this way? I’ve heard Xgrid mentioned, but do not really know anything about it, does anyone know how it works?

Last week, google annouced that they would offer free storage of scientific data. Would it be too optimistic to think that within a year, some company would offer free grid computation resources? It doesn’t even have to be completely free, you could imagine a setup where you provide your “screensaver” CPU cycles — like seti@home etc. — for access to the grid for your own tasks. With an open platform for this, shouldn’t the open source community then be able to build a great interface to it?

Did insects kill the dinosaurs?

Here’s an interesting story: Insect Attack May Have Finished Off Dinosaurs.

Apparently, a lot of disease carrying insects appeared at the time of the dinosaur mass extinction. If the dinosaurs’ immune system was not up to the task of defending the host against this, that might be what killed them all off.

“We can’t say for certain that insects are the smoking gun, but we believe they were an extremely significant force in the decline of the dinosaurs,” Poinar said. “Our research with amber shows that there were evolving, disease-carrying vectors in the Cretaceous, and that at least some of the pathogens they carried infected reptiles. This clearly fills in some gaps regarding dinosaur extinctions.”

Personally, I know nothing about dinos and cannot judge if this is a reasonable theory or not, but I did find it an interesting read.

A study of duplicate citations in Medline

In the latest issue of Bioinformatics, there’s a paper on duplicated publications:

Déjà vu—A study of duplicate citations in Medline

M. Errami et al.

Motivation: Duplicate publication impacts the quality of the scientific corpus, has been difficult to detect, and studies this far have been limited in scope and size. Using text similarity searches, we were able to identify signatures of duplicate citations among a body of abstracts.

Results: A sample of 62 213 Medline citations was examined and a database of manually verified duplicate citations was created to study author publication behavior. We found that 0.04% of the citations with no shared authors were highly similar and are thus potential cases of plagiarism. 1.35% with shared authors were sufficiently similar to be considered a duplicate. Extrapolating, this would correspond to 3500 and 117 500 duplicate citations in total, respectively.

They have gone text mining looking for significant (textual) overlap between papers, spotting both cases of plagiarism and of duplicated papers from the same authors.

Both situations are unethical. Plagiarism is plain old stealing — the scientific ideas of a scientist is the most important contributions of the scientist, so if someone else steals those ideas, it is probably the worse thing that can happen. At least there were only a few cases of plagiarism.

Duplicated publications are just annoying. It when I am discovering halfway through a paper that I have read it elsewhere from a different journal. Of course, if it is an exact duplication I will discover it earlier than half-way through, but on several occasions it is somewhat re-written but the results are exactly the same as a previous paper. The cited paper discovers 1.35% of duplications, but how well their text-mining spots duplicated results with slightly re-written papers, I don’t know.

In any case, they only compare abstracts, and I don’t remember a case where I have spotted a duplication based on the abstract.

If you want to browse their discovered duplications, you can find their database here.

My own duplications

I have a few duplications myself, I must admit, but except for one case (which I’ll get back to below), those are journal Special Issue version of conference contributions. What happens there is that a subset of the conference contributions are selected for journal publication (in most cases in a slightly extended version).

In such cases, where it is blindly obvious that it is a duplication of a conference paper (the journal makes that very explicit) I don’t see any problems with duplications. The Bionformatics paper agrees: Quoting from the paper:

While some duplications may be justified, arguably to promote wider dissemination or to provide important updates to clinical trials, surreptitious duplications that are covert and do not properly acknowledge the original work are unethical.

The last case of duplicated publications for me is the two papers

Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of Workshop on Algorithms in Bioinformatics (WABI), 2005, LNBI 3692, pp. 77-88 © Springer-Verlag.

Quartet Distance between General Trees (extended abstract)
C. Christiansen, T. Mailund, C.N.S. Pedersen, and M. Randers
Proceedings of International Conference on Numerical Analysis and Applied Mathematics (ICNAAM) 2005, pp. 796-799 © Wiley-VCH Verlag GmbH & Co.

and there is a bit of a story behind this.

We first submitted to WABI, but then discovered an error in the paper that we couldn’t fix — it was in the time analysis of one of the algorithms where we had claimed O(n2) but couldn’t get below O(n2d2). So we retracted the paper from WABI, fixed the analysis, and submitted to ICNAAM where it got accepted as well. The retraction was ignored, however, despite several emails to the PC chairs, so in the end we had to submit a final version. Since the ICNAAM version is just an extended abstract and the WABI paper is full length, we figured we could justify this, but it is a bit borderline, I think.

Acceptable duplication?

Determining if a paper is a duplicate based on only text similarity is a bit unsafe, of course. I tend to describe the problems I am working on, related work, consequences, etc. in similar terms from paper to paper. I try to avoid phrasing it the same, but it is hard not to do, and I know several cases where the introduction section of my papers read very similar.

I personally don’t see a problem in this, if the results presented are novel, but I guess it is a bit borderline as well.

Smallest publishable increment

Something that annoys me more than duplicated publications, though, is papers describing tiny increments on existing results. These papers mean that you have to read 4-5 papers to get the information the could easily be contained in a single paper.

Now, some of this is unavoidable. If the authors get an idea after the original idea is published — this has happened to me a few times — the choice is either never to publish, or to publish a minor increment. But with some authors — I could name names but I won’t — more than half the papers are tiny increments to previous ideas. This tells me that they either publish way too early, or that the willfully try to get as many publications out of as little thinking as they can.

Why bother?

Why would you publish the same results twice, or publish tiny increments?

It will boost the number of publications, but who cares about that? Even the silliest bureaucrats have figured out that what matters is impact.

If you want to boil impact down to a single number, so you can reduce the quality of a research to something that is easily measured, you don’t use the number of publications. You pick something like the h-index or such. There, the number of publications matters, but only if people cite them. You are better off with 10 papers cited 10 times each than with 50 papers cited 5 times each.

Duplicating publications doesn’t lead to greater impact. Citing the Bioinformatics paper again:

In the Duplicate/DA category, however, we observed that duplications were predominantly in journals with no impact factor and that these articles were rarely cited. If the primary value of a publication is to disseminate scientific findings and knowledge, it is not accomplished by publications in this category, so one must question the intent of the author of a Duplicate/DA publication.

In short, duplicated publications will not increase the impact, so why bother?

The citation, for Research Blogger:
Errami, M., Hicks, J.M., Fisher, W., Trusty, D., Wren, J.D., Long, T.C., Garner, H.R. (2007). Deja vu A study of duplicate citations in Medline. Bioinformatics, 24(2), 243-249. DOI: 10.1093/bioinformatics/btm574


Brian Vinter pointed this press release out to me: Denmark Creates Network for Gene Sequencing.

CLC bio and several prominent Danish research institutions have established SEQNET — a national network for developing a unique software platform for the analysis of data from the next generation sequencing technologies. The platform will integrate groundbreaking bioinformatics algorithms with a user-friendly and graphical user interface.

Apparently, my good old friend Roald Forsberg is involved:

Senior Scientific Officer at CLC bio, Dr. Roald Forsberg, states, “Next generation sequencing technologies, like 454, Solexa, or SOLiD are pushing a revolution in genetic analysis. Their massive throughput has given rise to a plethora of novel applications for DNA sequencing and has dramatically increased the ambitions of existing projects. However, handling the large amounts of fragmented data presents a great bioinformatics challenge to be dealt with before researchers can get the full value of these new technologies. Since DNA sequencing is becoming omnipresent in research we believe that the answer to this challenge is a unified next generation sequencing platform. In this network, we will make such a platform come together by combining our unique capacities for producing graphical user interfaces, algorithms and high performance computing solutions with the expertise of Denmark’s foremost researchers in the field.”

We have talked about the problems involved in dealing with data from the new high-throughput sequencing technologies a couple of times over lunch, but it seems Roald is moving faster than I am here. Good for him!

I look forward to see where this is leading!