Automating scientific grunt-work

Monday, John Hawks asked: Will Wolfram make bioinformatics obsolete?

I was talking with a scientist last week who is in charge of a massive dataset. He told me he had heard complaints from many of his biologist friends that today’s students are trained to be computer scientists, not biologists. Why, he asked, would we want to do that when the amount of data we handle is so trivial?

Now personally I wouldn’t call the amount of data trivial, exactly, but it does pale compared to some physics experiments.

Yesterday, Daniel MacAuthor (Genetic Future) responded with this:

I’d agree that biological data-sets can’t compete with particle physicists in terms of sheer scale, although the speed with which they are accumulating is alarming. Where biological data-sets really become intimidating is in their diversity, in the complexity of the underlying processes, and in the levels of noise and bias. I suspect a lot of people used to dealing with extremely large data-sets would still balk at the complexity of computational biology once they dug a little deeper, particularly in a few years’ time.

Which I fully agree with.

Anyway, back to John:

Now, you have to understand, to this person a dataset of 1000 whole genomes is trivial. He said, don’t these students understand that in a few years all the software they wrote to handle these data will be obsolete? They certainly aren’t solving interesting problems in computer science, and in a short time, they won’t be able to solve interesting problems in biology.

He then turns to Wolfram Alpha as an example of a computer system that could replace the need for programming skills with just plain English queries, thus alleviating the need for programming for biologists.

Now personally, I am very sceptical about this.  It sounds too much like a full AI to be true, but that is not the point I’m aiming at here.

Daniel brings up the points that an expert system like this will only help so far:

That said, such tools and databases, however powerful, will always lag substantially behind the science. For young biologists who want to work right at the cutting edge – which will require dealing directly with rapidly changing technologies, generating biological data at an increasingly dizzying pace and in constantly evolving formats – solid informatic skills, including at least basic programming and sound statistical knowledge, will make you a far more productive scientist.

Of course programming languages will change and the scripts you write as a grad student will be forgotten within a year or two – that’s the nature of science (how many molecular biologists still run Southern blots?). The important thing is learning how to think about large-scale biological data: how to access, filter and manipulate it. Having basic programming expertise will make you more effective as a scientist right now, and it will also prepare you for a career in an increasingly data-driven field.

Yes, the important thing is to learn how to think about large-scale biological data! More importantly, how to think about it in a structured way.

And “in a structured way” essentially means with a healthy mix of biological insight, mathematical modelling and statistical evaluation of the data.

With large-scale data, this cannot be done “by hand” but requires computer support.

Getting a computer to analyse your data really requires structured thinking. Nothing punishes fuzzy thinking quite like a computer.

Of course, our computer systems improves year by year, and the kind of basic programming skills you might have learned five or ten years ago are now obsolete.  If you attack basic statistical modelling with C or assembly programming, you are just doing it wrong.

This doesn’t mean that the basic skills you learn, when you learn how to program, are obsolete.  With improved computer systems and improved programming languages, you can work at a much higher level, but the essential structured thinking (plus basic testing and validation) is still just as important.

Just because we now have very powerful calculators doesn’t mean that it is a waste of time to study math.

I strongly feel that a little bit of computer science should be taught to all scientists, just as a bit of math and a bit of stats should be taught.  Not the low-level stuff.  Not C or Perl programming, but “essentials” of programming.  Just like you shouldn’t do hours and hours of sums to learn math.  That is just grunt work that should be left to our computer systems.

The basic computer science could be a bit on complexity (what can be done by a computer and what cannot; what can be efficiently done and what cannot); some basic programming (a single high-level language, just to get the feeling for programming; how to test programs); some numerical analysis (it doesn’t matter if your math is correct for real numbers if it is completely unstable when you work with floating point numbers); and some basic data structures and algorithms for every day work.

If you have a computer system already, that meets all your needs, you do not need this of course.  But what are the chances of having such a system available throughout your career?  What happens when you get new types of data or new kinds of experiments?

With just a bit of computer skills, you can update your system and get back to your science.  You can get the computer to do the grunt work again.

Without computer skills, it is all or nothing.  Either you get all the answers you want from the system, or you have to do it all manually if the system doesn’t quite meet your need.

78-96=-18

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

4 thoughts on “Automating scientific grunt-work”

  1. You are very welcome :) It is all I use myself now… for the last couple of years, really.

    The Guile version was was really a mistake … when I wrote the simulator, I just needed a very simple configuration language, but as the features kept creeping in it got more and more complicated, and (for me at least) the Python interface scales much better with that.

Leave a Reply