Posts Tagged ‘programming’

My first dashboard widget

Tuesday, May 25th, 2010

These days I find myself converting between “coalescence units” (a time unit of 2Ne generations) and “substitution units” (time measured in expected number of substitutions) several times a day.   It is not really much of a problem to do, but my brain is simply not wired to do arithmetic in my head, so I usually fire up R or Python for this.

I am learning Objective-C and Cocoa these days, more for fun than anything else, but I figured that I could write a small unit conversion application to get some use out of it.  I mentioned this to Kasper, but he suggested that I write a dashboard widget instead.

I have never used Dashcode before, but I fired it up this morning before I headed for the office, to get a feeling for how it works, with the intention of writing a widget during the week.  It turns out it is extremely easy to work with, though, and within half an hour I had a complete conversion widget coded up.

The only thing I don’t quite like about it yet is that when I do the conversions it doesn’t present the results in the text field in scientific notation, so I get stuff like 1000000 instead of 1e6 or 0.0001 instead of 1e-4, which makes the conversions somewhat harder to read.

I’m sure there is a way to format the numbers, but my Javascript-fu is not up to it.  This widget is the first Javascript I have ever written.

Boost and Xcode, united through CMake

Friday, May 7th, 2010

A while back I complained about my problems with developing applications using Boost in Xcode.  Between then and now I had a long period where I couldn’t even get Boost compiled on my Mac.

Now, however, I’ve found that there is a CMake based distribution of Boost.  This is great since CMake can Xcode projects directly, which makes it easy to make sure the libraries build will work with projects I make directly in Xcode.  Which incidentally is also why I was happy to see Bio++ move to CMake.

My only problem now is that I have a bunch of old software based on Automake that uses Boost.  They don’t seem to play well together on my Mac.  I should port those to CMake, but first I need to learn how to use CMake, and time is not something I have plenty of at the moment.  Maybe this should be a project for the summer holiday.

Does anyone know of any scripts that can assist me in converting Autoconf/Automake configurations to CMake?

Testing an O(n log n) running time

Tuesday, May 4th, 2010

This morning I noticed that this blog had been hit by this search term:

I’m not sure why it hit my blog since I have never written about that (as far as I recall), but I know that my current students have to worry about an O(n \log n) running time algorithm for their current project.

They have to implement an algorithm for finding tandem repeats, and the essential part of the algorithm is locating “branching” tandem repeats in time O(n \log n).  To validate that they get it right, I ask them to run some profiling and describe how they test this.  Perhaps it is my students googling to figure out how?

Well, depending on how you think about this, it is either very simple or very hard to validate an O(n \log n) running time.

What does the big-O notation really mean?

When we have an algorithm that runs in time O(n \log n), what we mean is that the worst case running time on input of size n is (eventually) dominated by \alpha \cdot n \log n for some constant \alpha. So if we consider the running time for all input of size n and take the maximum of that, that is the worst case running time.  Let t(n) denote this.  Saying t(n)\in O(n\log n) simply means that there exists an \alpha such that \exists N\,:\,\forall n > N\,:\, t(n) \leq \alpha \cdot n \log n.

Sometimes we don’t want to argue about worst case running time. For example sometimes we worry more about the expected running time, then the t(n) function would take the expectation rather than the maximum over all input of size n.  If this is the case, we make it explicit, so whenever we talk about O(n \log n) without any qualifier, we are talking worst case running time.

By this definition, t(n)\in O(n \log n) doesn’t mean that t(n) is some function t(n) = \beta n \log n or anything like that.  The big-O notation only puts an upper bound on t(n), it doesn’t say that this upper bound is tight or anything.  We have other notations for that, but you don’t see it so much out of algorithmics circles.

Anyway, to simplify matters, I am going to assume for the following that n \log n is a tight bound, and by this I mean that t(n) = \beta \cdot n \log n + o(n \log n) for some constant \beta.  Here the little-O notation means that there might be some extra time consumption in t(n) but that \beta \cdot n \log n - t(n) tends to zero as n tends to infinity.

Can you prove a certain running time with experiments?

For an O(n \log n) algorithm, we mathematically prove that the worst case running time is dominated by some \alpha n \log n.  We don’t experiment our way out of it.  But if we wanted to, could we?

Let’s assume for a second that we don’t have to worry about all input of size n to compute t(n) – and that’s a big assumption in many cases, since many algorithms have different running time depending on what the actual input is, so the running time isn’t really a function of just the size of the input.  If we really did have to worry about this we would need to run our program on all possible input of size n for each n we want to examine, and the combinatorial complexity of this would soon make it impossible to even attempt.

So let’s assume that the running time only depends on the size of the input. Could we then distinguish between, say t(n) = \beta \cdot n \log n and t(n) = \beta \cdot n \log n + \delta \cdot n^2?

If \delta \ll \beta it is pretty hard!

ns <- seq(1,1000)
t1s <- 5 * ns * log(ns)
t2s <- 5 * ns * log(ns) + 1e-4 * ns**2

plot(ns, t1s, type='l', col='blue', xlab='n', ylab='t(n)')
lines(ns, t2s, col='red')

The two are pretty close!  And that is ignoring measurement errors that would muddy the picture further.

Of course, eventually, for some large n, the \delta\cdot n^2 term will dominate the \beta\cdot n \log n term, but for the range you measure it might not.

With your experiments you are best just validating that the algorithm, mathematically proved to be O(n \log n) was implemented correct and appears to be running in time \beta \cdot n \log n (but again remember that this is assuming that t(n) only depends on n and not the actual input and that t(n) = \beta \cdot n \log n + o(n \log n) i.e. that the bound is tight.

Validating an n log n running time

So how do you validate that t(n) = \beta \cdot n \log n?

Well, if t(n) = \beta \cdot n \log n then t(n)/\log n = \beta \cdot n, so just plot t(n)/\log n and check if it looks like a line!

ns <- seq(1,1000)
ts <- 5 * ns * log(ns)

plot(ns, ts / log(ns), type='l', col='blue', xlab='n', ylab='t(n)/log(n)')

Simple, eh?

What about the base of the logarithm?

In the algorithm my students are implementing, the \log n in O(n \log n) isn’t the natural logarithm.  It rarely is in computer science.  It isn’t even the base-2 logarithm as it often is.  The base actually depends on the kind of input (but let’s keep ignoring that).

What do you do when you don’t know the base of the logarithm and you want to validate an O(n \log n) algorithm?  Do you need to divide t(n) with the right base log to get a line?

Of course not.  Remember \log_b x = \frac{\log_k x}{\log_k b}, so if the running time is with a log base b: t(n) = \beta \cdot n \log_b n = \frac{\beta}{\log_k b} \cdot n \log_k n, and you plot it with a log base k, then t(n)/\log_k n = \frac{\beta}{\log_k b} \cdot n is still a line.

ns <- seq(1,1000)
ts <- 5 * ns * log10(ns)                # base 10 logarithm

## plotting with base 2 logarithm
plot(ns, ts / log2(ns), type='l', col='blue', xlab='n', ylab=expression(t(n)/log[2](n)))

That is why, when we write the complexity as O(n \log n) we don’t bother with the base.  All logs are the same under a big-O.

In practice, of course, the base of the logarithm matters as it affects the constant in front of the n \log n, but for the big-O notation it doesn’t, and for validating that your program runs in time O(n\log n) it doesn’t either.

Is R an ‘epic fail’?

Monday, April 26th, 2010

Is R an ‘epic fail’?

Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.

For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language.  If you don’t want to do any programming, don’t muck about with R!  There are lots of visualisation tools and statistical tools that are much easier to use.

Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.

You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it.  Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).

Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?

On code and comments…

Wednesday, April 21st, 2010

I’ve never been a big fan of comments in code.  Mainly because I too often have seen comments explaining the trivial and ignoring the complex…

In most cases, clear code eliminates the need for comments, as discussed here.

I used to think commenting my code was the responsible thing to do. I used to think that I should have a comment for just about every line of code that I wrote. After my first read of Code Complete, my views changed pretty drastically.

I began to value good names over comments. As my experience has increased, I have realized more and more that comments are actually bad.

Actually, Code Complete has a more nuanced discussion on commenting code, but still…

Comments are often not needed, because they just rephrase what you can already read in the code. If at all possible, make the code easier to read rather than explain it in code.

When comments are needed, they explain design decisions that are not obvious from the code. Then there is too often the risk that the design has changed since the comment was written and that is really worse than no comment.

Still, it is when it comes to design decisions that I often miss documentation. Especially when it comes to complex class hierarchies and object interactions where there is clearly some underlying design decisions about how the objects are suppose to interact and how new classes should be added to the hierarchy to extend the code.

I rarely find that stuff documented, though.  At best I am told that for function add(a,b), “a and b are input” and “add(a,b) returns a+b” or something obvious like that…  or that the class “AbstractVisitor” is an abstract visitor class.  Duh!

I would love it if people would stop commenting the obvious but start explaining their design decisions…