In a previous post I asked “Why are we teaching an introductory programming class for bioinformatics, where there is already an introductory programming class in the Dept. of Computer Science?” Below, I’ll try to answer that question.
A different approach to programming
The short answer is that the approach to programming is very different between computer science students and (real) science students. Computer science students consider programming something worth learning in itself, whereas other students often consider it a necessary evil they have to learn in order to work with the material they are really interested in.
This is perfectly understandable. If your interest is in biology, then it is the biological questions that you are interested in. Statistics and programming is necessary for analysing your data — more and more so as the types and the quantity of data changes — but your main interest is not the statistics or the programming; it is the biology.
Bioinformatics students are probably somewhere in between computer science students and biology/medicine students. If you do not enjoy working with computers, bioinformatics is not the topic for you. If you do not care about the biological questions but only the algorithm design, software engineering, etc. you are better of in computer science than bioinformatics.
Anyway, in the class I will teach next term, about 60 of the students are not bioinformatics students nor computer science students. They are studying medicine and just need some basic programming to be able to solve bioinformatics tasks in their “real” work.
Showing then “neat tricks” or clever design patterns is not the way to go.
One size doesn’t fit all
The kind of programming you need to learn depends a lot on what you want to do with your programs. If you are doing number crunching, you want to worry about numerical algorithms and such. If you are building real-time systems, time constraints and response time is everything. If you are building large software systems with millions of lines of code, the key thing is proper software engineering.
In Aarhus, we teach the computer science students to be a mix of “classical” computer scientists and software engineers / software designers. We have a lot of classes that are pure theoretical computer science — everything is done on blackboards and implementing anything is frowned upon — and we have a lot of classes concerning software architecture and such.
There isn’t really a market for pure theoretical computer science outside of academia here, so most of our students end up in jobs where designing and implementing large software systems is the main focus. The introductory programming class reflects this. There is the necessary basic programming, such as learning the control structures and a bit about data structures, and on top of that it is design patterns and the type system and such. The programming language is Java, probably because it is popular, statically typed and OO.
This is fine for computer science students. It is just their first programming class, and they will specialise in other classes.
I don’t think it is the right choice if it is the only programming class you take, and you want to use the programming for bioinformatics.
It isn’t the right choice for the physics or chemistry students that really should worry more about numerical algorithms (which is not covered in this class) and would probably be better off with a Matlab tutorial and some numerical analysis.
But physics and chemistry students are not my concern and not my problem…
Scripting and programming
Ignoring spreadsheets — which might be the most important tool for many analyses — I would guess that 90%+ of the programming tasks a bioinformatician needs to solve are what I would call “script programming”.
You write a program to automate a work-flow. You need to parse simple text files to extract relevant information. You combine programs in pipelines with small converter programs in between them, to translate the output format of one program into the input format of the next.
There is very little focus on this in the computer science programming class. There it is all about “proper” programming: designing the right class hierarchies, combining the right data structures, choosing the right algorithms for the task at hand… Worrying about IO is only a necessary evil, and one that is mostly ignored, and I doubt that there is any communication with other programs.
In scripting, the right data structure and the right algorithm is rarely much of a problem. If your scripts are much too slow, you worry about it, but more often than not, you are happy if they can do what they do in reasonable time. It is not worth the effort to speed them up.
The right structuring of the code isn’t that much of an issue either. Of course the code should be readable when you return to it after a few weeks or months, but you never worry about the grand design, since the program is pretty small anyway.
Sure, there are some applications where you need all the canons from computer science, but it is pretty rare in day to day life. If you need it, take a class at that time, or just give a computer scientist a Mars bar and a Pizza to do it for you.
Learning it, just in case, is most likely just wasting your time.
The programming tasks in bioinformatics simply do not align with the skills taught in the introductory programming class in the Department of Computer Science, and that is why we need our own.
As for what goes into it, that is a topic for another day…