Testing an O(n log n) running time

This morning I noticed that this blog had been hit by this search term:

I'm not sure why it hit my blog since I have never written about that (as far as I recall), but I know that my current students have to worry about an \(O(n \log n)\) running time algorithm for their current project.

They have to implement an algorithm for finding tandem repeats, and the essential part of the algorithm is locating "branching" tandem repeats in time \(O(n \log n)\).  To validate that they get it right, I ask them to run some profiling and describe how they test this.  Perhaps it is my students googling to figure out how?

Well, depending on how you think about this, it is either very simple or very hard to validate an \(O(n \log n)\) running time.

What does the big-O notation really mean?

When we have an algorithm that runs in time \(O(n \log n)\), what we mean is that the worst case running time on input of size \(n\) is (eventually) dominated by \(\alpha \cdot n \log n\) for some constant \(\alpha\). So if we consider the running time for all input of size \(n\) and take the maximum of that, that is the worst case running time.  Let \(t(n)\) denote this.  Saying \(t(n)\in O(n\log n)\) simply means that there exists an \(\alpha\) such that \(\exists N\,:\,\forall n > N\,:\, t(n) \leq \alpha \cdot n \log n\).

Sometimes we don't want to argue about worst case running time. For example sometimes we worry more about the expected running time, then the \(t(n)\) function would take the expectation rather than the maximum over all input of size \(n\).  If this is the case, we make it explicit, so whenever we talk about \(O(n \log n)\) without any qualifier, we are talking worst case running time.

By this definition, \(t(n)\in O(n \log n)\) doesn't mean that \(t(n)\) is some function \(t(n) = \beta n \log n\) or anything like that.  The big-O notation only puts an upper bound on \(t(n)\), it doesn't say that this upper bound is tight or anything.  We have other notations for that, but you don't see it so much out of algorithmics circles.

Anyway, to simplify matters, I am going to assume for the following that \(n \log n\) is a tight bound, and by this I mean that \(t(n) = \beta \cdot n \log n + o(n \log n)\) for some constant \(\beta\).  Here the little-O notation means that there might be some extra time consumption in \(t(n)\) but that \(\beta \cdot n \log n - t(n)\) tends to zero as \(n\) tends to infinity.

Can you prove a certain running time with experiments?

For an \(O(n \log n)\) algorithm, we mathematically prove that the worst case running time is dominated by some \(\alpha n \log n\).  We don't experiment our way out of it.  But if we wanted to, could we?

Let's assume for a second that we don't have to worry about all input of size \(n\) to compute \(t(n)\) - and that's a big assumption in many cases, since many algorithms have different running time depending on what the actual input is, so the running time isn't really a function of just the size of the input.  If we really did have to worry about this we would need to run our program on all possible input of size \(n\) for each \(n\) we want to examine, and the combinatorial complexity of this would soon make it impossible to even attempt.

So let's assume that the running time only depends on the size of the input. Could we then distinguish between, say \(t(n) = \beta \cdot n \log n\) and \(t(n) = \beta \cdot n \log n + \delta \cdot n^2\)?

If \(\delta \ll \beta\) it is pretty hard!

ns <- seq(1,1000)
t1s <- 5 * ns * log(ns)
t2s <- 5 * ns * log(ns) + 1e-4 * ns**2

plot(ns, t1s, type='l', col='blue', xlab='n', ylab='t(n)')
lines(ns, t2s, col='red')

The two are pretty close!  And that is ignoring measurement errors that would muddy the picture further.

Of course, eventually, for some large \(n\), the \(\delta\cdot n^2\) term will dominate the \(\beta\cdot n \log n\) term, but for the range you measure it might not.

With your experiments you are best just validating that the algorithm, mathematically proved to be \(O(n \log n)\) was implemented correct and appears to be running in time \(\beta \cdot n \log n\) (but again remember that this is assuming that \(t(n)\) only depends on \(n\) and not the actual input and that \(t(n) = \beta \cdot n \log n + o(n \log n)\) i.e. that the bound is tight.

Validating an n log n running time

So how do you validate that \(t(n) = \beta \cdot n \log n\)?

Well, if \(t(n) = \beta \cdot n \log n\) then \(t(n)/\log n = \beta \cdot n\), so just plot \(t(n)/\log n\) and check if it looks like a line!

ns <- seq(1,1000)
ts <- 5 * ns * log(ns)

plot(ns, ts / log(ns), type='l', col='blue', xlab='n', ylab='t(n)/log(n)')

Simple, eh?

What about the base of the logarithm?

In the algorithm my students are implementing, the \(\log n\) in \(O(n \log n)\) isn't the natural logarithm.  It rarely is in computer science.  It isn't even the base-2 logarithm as it often is.  The base actually depends on the kind of input (but let's keep ignoring that).

What do you do when you don't know the base of the logarithm and you want to validate an \(O(n \log n)\) algorithm?  Do you need to divide \(t(n)\) with the right base log to get a line?

Of course not.  Remember \(\log_b x = \frac{\log_k x}{\log_k b}\), so if the running time is with a log base \(b\): \(t(n) = \beta \cdot n \log_b n = \frac{\beta}{\log_k b} \cdot n \log_k n\), and you plot it with a log base \(k\), then \(t(n)/\log_k n = \frac{\beta}{\log_k b} \cdot n\) is still a line.

ns <- seq(1,1000)
ts <- 5 * ns * log10(ns)                # base 10 logarithm

## plotting with base 2 logarithm
plot(ns, ts / log2(ns), type='l', col='blue', xlab='n', ylab=expression(t(n)/log[2](n)))

That is why, when we write the complexity as \(O(n \log n)\) we don't bother with the base.  All logs are the same under a big-O.

In practice, of course, the base of the logarithm matters as it affects the constant in front of the \(n \log n\), but for the big-O notation it doesn't, and for validating that your program runs in time \(O(n\log n)\) it doesn't either.

Tags: , ,

2 Responses to “Testing an O(n log n) running time”

  1. Ras Says:

    > just plot t(n)/log n and check if it looks like a line!
    Well what if the plot behaved like a sqrt-function or a log-function .. would that mean that the system was not O(nlogn)? Big-O is an upper bound, so a line or anything asymptotically smaller than a line would be a 'validation'.

  2. Thomas Mailund Says:

    Absolutely. I also say that above, before simplifying the example to make it a tight bound, but you are of course right that any O(n) plot of t(n)/log(n) would validate t(n) in O(n log n)

Leave a Reply