Posts Tagged ‘script programming’

Python lectures

Saturday, September 27th, 2008

Thinking more about teaching Python, I googled for Python lectures and found these videos here and here.

Profiling Python

Tuesday, September 23rd, 2008

I’ve spent today writing some scripts to analyse a genome. I have about 2 giga bases of alignment, so it can be pretty slow to extract information from it.  Especially since the cluster here at BiRC is in heavy use these days, so I cannot gain much from parallelising the analysis.

Psyco and JIT compiling

I had hoped that I could gain the speed I needed using Psyco.  I’ve been lucky with that in the past, getting about x10 in speed.

It is a JIT (just-in-time) compiler that actually compiles (generate machine code) the code rather than interpret it.

With languages such as C++ you need to compile the code more or less manually before you can run it.  That translate the high-level code into machine code that can actually run on the machine.

In Java you also compile the high-level code, but there you compile it into something called byte code that cannot actually run on the machine.  Instead the byte code can be interpreted in a Java Virtual Machine (JVM).  This is how you get their “compile once, run everywhere”.

This is actually also what Python does, you probably just haven’t noticed, because Python does it automatically. When it loads a script, it first translates it into byte code and then it interprets it.  If you load a module, you might have noticed a .pyc file.  That is the byte code for that module.  But you only see these files for modules, ’cause Python doesn’t write the byte code to file for your scripts.

For modern virtual machines, like JVM, the byte code is not just interpreted.  The machine will analyse how the program executes and compile important functions — the hot spots — into machine code for faster execution.  This is what is called just in time compilation (sometimes just too late, but that is a bit cruel).

This, by the way, is also the technology running under the hood of Google’s chrome browser.  There, the virtual machine runs Javascript, but the idea is essentially the same as for the JVM.

Lars Bak, the lead engineer behind that virtual machine, explains it here:

To stray a bit from the main point I can mention that I had a class some years ago taught by Lars here in Aarhus.  Many of the people now working with Lars had the same class, so that was a successful class.  He is going to give it again this year, so don’t you just wish you were studying computer science at our university right now? ;-)

Anyway, back to Python.  Here you can also get JIT compiling using the module psyco.  All you have to is add these two lines to your program:

import psyco
psyco.full()

You can get finer control over the JIT compilation by doing a little more, but this is the simplest way to get it going, and usually it pays off nicely.  It is a bit memory hungry, so it isn’t always the thing to do, but more often than not, it is.

You get the most out of it when your code is actually doing something serious.  Lots of loops and branches and such.  Typical algorithmic code.

If you are mainly calling built in functions, you do not gain as much.  Those are already compiled, typically, so they are pretty optimised as it is.

For my scripts, I am mainly scanning through the alignment and updating statistics in a bunch of tables, and I didn’t gain more than maybe a factor of two from psyco.  Not really enough for the speed I need.

Profiling Python

Before you go ahead and paint “go faster stripes” on your code, you should always profile it.  You will be surprised at where the majority of time is spent.

In Python you have three modules for profiling: profile, cProfile and hotshot.  The first two have the same functionality, but cProfile is written in C and is faster.

I typically use cProfile.  The hotshot module is not maintained (according to the Python documentation) and the documentation certainly isn’t, so I have never quite figured out how to use it.  I don’t have to either, though, ’cause cProfile does all I need it to.

With cProfile you can load the module in and have fine control over what you profile, but I find it the easiest just to profile the entire script.  You can do this by loading the module directly when you call your script, like this:

python -m cProfile your-script.py

After your program has finished, cProfile will write a summary of where the time was spent in the run.

If you want to analyse the result in more detail, you can write the profiler’s analysis to a file

python -m cProfile -o profile.report your-script.py

that you can then read into Python using the module pstats

>>> import pstats
>>> p = pstats.Stats('profile.report')

and you can then sort the report using various criteria.  For example, to get the top 10 runtime sinners, you can do:

>>> p.sort_stats('time').print_stats(10)
Tue Sep 23 15:09:35 2008    profile.report
         7932274 function calls in 23.290 CPU seconds
   Ordered by: internal time
   List reduced from 119 to 10 due to restriction <10>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.303   10.303   23.276   23.276 extract-statistics.py:2(<module>)
  1314684    7.036    0.000   11.282    0.000 extract-statistics.py:47(is_singleton)
  5258736    4.246    0.000    4.246    0.000 extract-statistics.py:50(<genexpr>)
   377622    1.187    0.000    1.215    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/core.py:125(column_iter)
   862137    0.391    0.000    0.391    0.000 extract-statistics.py:40(is_informative)
     9991    0.028    0.000    0.028    0.000 {range}
    39960    0.022    0.000    0.022    0.000 /usr/lib/python2.5/site-packages/bx_python-0.5.0-py2.5-linux-i686.egg/bx/align/core.py:193(__init__)
        1    0.014    0.014   23.290   23.290 {execfile}
    10989    0.014    0.000    0.014    0.000 extract-statistics.py:53(windows)
     4995    0.010    0.000    0.013    0.000 extract-statistics.py:28(<genexpr>)

Here the topmost line is the entire script, and that is using all the time.  Not really a surprise.  The next three or four lines tells me where I should put my optimisation efforts.

(If you are now wondering why I bother optimising something that takes 23 CPU seconds then congratulations for paying attention to the report and being sceptical about when to bother optimising.  I am only running my scripts on a very tiny fraction of my data for the profiling — anything else would take much too long — so that is why.  For the real data, optimising really is needed).

Why do we need a separate class for programming in bioinformatics?

Friday, September 19th, 2008

In a previous post I asked “Why are we teaching an introductory programming class for bioinformatics, where there is already an introductory programming class in the Dept. of Computer Science?”  Below, I’ll try to answer that question.

A different approach to programming

The short answer is that the approach to programming is very different between computer science students and (real) science students. Computer science students consider programming something worth learning in itself, whereas other students often consider it a necessary evil they have to learn in order to work with the material they are really interested in.

This is perfectly understandable. If your interest is in biology, then it is the biological questions that you are interested in.  Statistics and programming is necessary for analysing your data — more and more so as the types and the quantity of data changes — but your main interest is not the statistics or the programming; it is the biology.

Bioinformatics students are probably somewhere in between computer science students and biology/medicine students.  If you do not enjoy working with computers, bioinformatics is not the topic for you.  If you do not care about the biological questions but only the algorithm design, software engineering, etc. you are better of in computer science than bioinformatics.

Anyway, in the class I will teach next term, about 60 of the students are not bioinformatics students nor computer science students.  They are studying medicine and just need some basic programming to be able to solve bioinformatics tasks in their “real” work.

Showing then “neat tricks” or clever design patterns is not the way to go.

One size doesn’t fit all

The kind of programming you need to learn depends a lot on what you want to do with your programs. If you are doing number crunching, you want to worry about numerical algorithms and such. If you are building real-time systems, time constraints and response time is everything.  If you are building large software systems with millions of lines of code, the key thing is proper software engineering.

In Aarhus, we teach the computer science students to be a mix of “classical” computer scientists and software engineers / software designers.  We have a lot of classes that are pure theoretical computer science — everything is done on blackboards and implementing anything is frowned upon — and we have a lot of classes concerning software architecture and such.

There isn’t really a market for pure theoretical computer science outside of academia here, so most of our students end up in jobs where designing and implementing large software systems is the main focus.  The introductory programming class reflects this.  There is the necessary basic programming, such as learning the control structures and a bit about data structures, and on top of that it is design patterns and the type system and such.  The programming language is Java, probably because it is popular, statically typed and OO.

This is fine for computer science students.  It is just their first programming class, and they will specialise in other classes.

I don’t think it is the right choice if it is the only programming class you take, and you want to use the programming for bioinformatics.

It isn’t the right choice for the physics or chemistry students that really should worry more about numerical algorithms (which is not covered in this class) and would probably be better off with a Matlab tutorial and some numerical analysis.

But physics and chemistry students are not my concern and not my problem…

Scripting and programming

Ignoring spreadsheets — which might be the most important tool for many analyses — I would guess that 90%+ of the programming tasks a bioinformatician needs to solve are what I would call “script programming”.

You write a program to automate a work-flow.  You need to parse simple text files to extract relevant information.  You combine programs in pipelines with small converter programs in between them, to translate the output format of one program into the input format of the next.

There is very little focus on this in the computer science programming class.  There it is all about “proper” programming: designing the right class hierarchies, combining the right data structures, choosing the right algorithms for the task at hand… Worrying about IO is only a necessary evil, and one that is mostly ignored, and I doubt that there is any communication with other programs.

In scripting, the right data structure and the right algorithm is rarely much of a problem.  If your scripts are much too slow, you worry about it, but more often than not, you are happy if they can do what they do in reasonable time.  It is not worth the effort to speed them up.

The right structuring of the code isn’t that much of an issue either.  Of course the code should be readable when you return to it after a few weeks or months, but you never worry about the grand design, since the program is pretty small anyway.

Sure, there are some applications where you need all the canons from computer science, but it is pretty rare in day to day life.  If you need it, take a class at that time, or just give a computer scientist a Mars bar and a Pizza to do it for you.

Learning it, just in case, is most likely just wasting your time.

The programming tasks in bioinformatics simply do not align with the skills taught in the introductory programming class in the Department of Computer Science, and that is why we need our own.

As for what goes into it, that is a topic for another day…

Teaching programming in bioinformatics

Tuesday, September 16th, 2008

Next term, I am teaching a class called applied programming, which is just a fancy title for script programming.

I finally got out of teaching programming a year or so ago.  I don’t really like to teach this topic.  I love programming, but I find it very hard to teach.  It is something you only learn by doing, and the typical university setup here is lectures.  That simply doesn’t work with programming classes. Period.

Anyway, the university has started a new bachelors program — molecular medicine — and they need to learn a lot of bioinformatics.  For that, they need some basic programming, and I have to teach them that together with Søren Besenbacher.

Teaching programming

In some sense, all skills you learn, you learn by “doing”.  Even if what you are doing is just thinking about it.

Asking students “to think about it” is not the way to go, though.  Trust me on that.

You need some techniques to start thinking.  If you discuss a topic with your friends — construct arguments for your case — you are forced to think about the topic.  If you have to present a topic to your class, you are forced to think about the topic.

When you teach a topic, you are really forced to think about it.

Programming, you really have to think about. You cannot fool the computer.  If you don’t know how to solve a problem, there is no bullshitting the computer into believing that you can.

So in a sense, it should be easy to teach programming. Give people a problem and let them solve it.  There is no cheating, and as long as the problem highlights the points you wish to teach, the students will learn it.

Why do I still find it hard to teach programming then?

Lecturing on programming

One problem is the way we typically teach here.  Teaching is very much based on lecturing followed by practical exercises with TAs.

It is mostly a practical matter. For practical exercises, you need small groups, and we would spend all our time on teaching if we had to do all the exercises ourselves.  Thus the TAs.  But you cannot leave the entire class to TAs and small groups, so we have the lectures to cover the broad topics and make sure that all the groups are keeping up with the teaching plan.

In many classes, this setup works fine, but I don’t think it helps much to lecture on programming.  The lectures in such a class can easily end up being a waste of time.

I can show you all the language constructs of Python on a PowerPoint slide, but that alone does nothing for teaching you how to use them.  It is absolutely worthless if you need to solve a problem using Python.

With 50+ students, I’m probably stuck with the lectures, but how do you structure lectures so you actually teach something useful for programming?

Teaching problem solving

Programming is, more than anything else, problem solving. So how do you teach people how to solve problems?  You show them how you do it yourself!

This is where it gets tricky, I think.  You are not showing people how you solve problems by writing down the problem and then the solution.  You need to show them how you think when you are making your way from problem to solution.  And you need to do it very slowly!

When you are experienced, you make leaps of intuition when you solve a problem.  You recognize something you have solved before and — probably without thinking about it — you immediately think of a solution that worked before.  It is very hard to avoid this.

You need to slow down when solving the problems.  You don’t want to dumb down, though!  You need to solve the problem without your experience, not without your wits. The students are less experienced, they are not stupid, and you won’t keep them interested if you are thinking slower than they are.

Ideally, you want the solutions you come up with this way to be the same you would come up with if you were using all your experience.  I personally hate it when text books come up with solutions you would never see in the real world, just because they haven’t introduced all the techniques you need to get the “right” solution; only enough to actually solve the problem, but in a roundabout and non-idiomatic way.

To avoid this, you need to come up with problems that not only can be solved with the techniques seen so far, but where the ideal solution only uses those techniques. I’m still trying to figure out how to do this…

Anyway, these are the ideas I am throwing around right now for how to approach lecturing on programming again.

Refusing to help (or “give them what they need, not what they want”)

So show you show the students how to solve a problem, and then you give them some problems to solve themselves.  This is where the TAs come into play.  The idea is that the students try to solve the problems themselves or in small groups, and then can meet with the other students in larger groups, maybe 15-20 people, together with a TA to discuss problems and solutions.

For small classes, I’ve been doing both the lecturing and the solution discussions.  My experiences here are not so pleasant.

First of all, it is very hard to get the students to actually try to solve the problems.  Not the computer scientists. In the previous script programming classes there were always a few computer scientists.  Those are already interested in programming and, to be fair, already know how to do it.  The non-computer science students rarely made an effort.

There is, of course, a major difference in being interested in programming and in considering programming a necessary tool for some other problem that is the real problem.

So there is a bigger problem in motivating the problems, but I don’t think that is the full story.

Once the students figure out that they can just show up at the meetings with the TA and then the TA will show them how to solve the problems they got stuck on, they will simply stop trying to solve anything themselves.  At the first sign of trouble, they just stop.  They do not try to work around the problems, or try to figure out why there is a problem in the first place.

This is probably the worst thing they can possibly do.  Programming is all about solving a larger problem by working around a bit pile of smaller problems.  You get stuck all the time, and need to get yourself “unstuck”.  With experience you will do this faster and faster, and often manage to avoid a lot of the small problems in the first place, but you need to get this experience by actually solving the small problems.

When people ask me how I would solve a given problem, I tend to tell them.  This makes me the worst TA ever in a situation like this.  I have tried and tried, with more success the older I’m getting, not to help too much.

It is essential that you don’t help a student who could actually solve the problem himself if he just worked a little harder and a little longer.  We learn from our own experience, not others.

What’s so special about programming in Bioinformatic?

What are we doing teaching programming anyway?

The Department of Computer Science teaches an introduction to programming class that is mandatory for half the student programs at the Faculty of Sciences.  So why do we have our own introductory programming class for bioinformatics (and related student programs)?

This is a good question, and there has been some debate over it. If only it wasn’t me who is supposed to teach the class, I would be a strong proponent for it ;)

Kidding aside, I do think there are strong arguments for having a different introductory programming class.

It is getting late, and I have a paper to read before going to bed, so I will leave  those arguments for another post, though.

Applied programming?

Well, I didn’t choose this title for the class.  I wanted script programming, but apparently that didn’t sound serious. Go figure.