Posts Tagged ‘R’

Testing an O(n log n) running time

Tuesday, May 4th, 2010

This morning I noticed that this blog had been hit by this search term:

I'm not sure why it hit my blog since I have never written about that (as far as I recall), but I know that my current students have to worry about an \(O(n \log n)\) running time algorithm for their current project.

They have to implement an algorithm for finding tandem repeats, and the essential part of the algorithm is locating "branching" tandem repeats in time \(O(n \log n)\).  To validate that they get it right, I ask them to run some profiling and describe how they test this.  Perhaps it is my students googling to figure out how?

Well, depending on how you think about this, it is either very simple or very hard to validate an \(O(n \log n)\) running time.

What does the big-O notation really mean?

When we have an algorithm that runs in time \(O(n \log n)\), what we mean is that the worst case running time on input of size \(n\) is (eventually) dominated by \(\alpha \cdot n \log n\) for some constant \(\alpha\). So if we consider the running time for all input of size \(n\) and take the maximum of that, that is the worst case running time.  Let \(t(n)\) denote this.  Saying \(t(n)\in O(n\log n)\) simply means that there exists an \(\alpha\) such that \(\exists N\,:\,\forall n > N\,:\, t(n) \leq \alpha \cdot n \log n\).

Sometimes we don't want to argue about worst case running time. For example sometimes we worry more about the expected running time, then the \(t(n)\) function would take the expectation rather than the maximum over all input of size \(n\).  If this is the case, we make it explicit, so whenever we talk about \(O(n \log n)\) without any qualifier, we are talking worst case running time.

By this definition, \(t(n)\in O(n \log n)\) doesn't mean that \(t(n)\) is some function \(t(n) = \beta n \log n\) or anything like that.  The big-O notation only puts an upper bound on \(t(n)\), it doesn't say that this upper bound is tight or anything.  We have other notations for that, but you don't see it so much out of algorithmics circles.

Anyway, to simplify matters, I am going to assume for the following that \(n \log n\) is a tight bound, and by this I mean that \(t(n) = \beta \cdot n \log n + o(n \log n)\) for some constant \(\beta\).  Here the little-O notation means that there might be some extra time consumption in \(t(n)\) but that \(\beta \cdot n \log n - t(n)\) tends to zero as \(n\) tends to infinity.

Can you prove a certain running time with experiments?

For an \(O(n \log n)\) algorithm, we mathematically prove that the worst case running time is dominated by some \(\alpha n \log n\).  We don't experiment our way out of it.  But if we wanted to, could we?

Let's assume for a second that we don't have to worry about all input of size \(n\) to compute \(t(n)\) - and that's a big assumption in many cases, since many algorithms have different running time depending on what the actual input is, so the running time isn't really a function of just the size of the input.  If we really did have to worry about this we would need to run our program on all possible input of size \(n\) for each \(n\) we want to examine, and the combinatorial complexity of this would soon make it impossible to even attempt.

So let's assume that the running time only depends on the size of the input. Could we then distinguish between, say \(t(n) = \beta \cdot n \log n\) and \(t(n) = \beta \cdot n \log n + \delta \cdot n^2\)?

If \(\delta \ll \beta\) it is pretty hard!

ns <- seq(1,1000)
t1s <- 5 * ns * log(ns)
t2s <- 5 * ns * log(ns) + 1e-4 * ns**2

plot(ns, t1s, type='l', col='blue', xlab='n', ylab='t(n)')
lines(ns, t2s, col='red')

The two are pretty close!  And that is ignoring measurement errors that would muddy the picture further.

Of course, eventually, for some large \(n\), the \(\delta\cdot n^2\) term will dominate the \(\beta\cdot n \log n\) term, but for the range you measure it might not.

With your experiments you are best just validating that the algorithm, mathematically proved to be \(O(n \log n)\) was implemented correct and appears to be running in time \(\beta \cdot n \log n\) (but again remember that this is assuming that \(t(n)\) only depends on \(n\) and not the actual input and that \(t(n) = \beta \cdot n \log n + o(n \log n)\) i.e. that the bound is tight.

Validating an n log n running time

So how do you validate that \(t(n) = \beta \cdot n \log n\)?

Well, if \(t(n) = \beta \cdot n \log n\) then \(t(n)/\log n = \beta \cdot n\), so just plot \(t(n)/\log n\) and check if it looks like a line!

ns <- seq(1,1000)
ts <- 5 * ns * log(ns)

plot(ns, ts / log(ns), type='l', col='blue', xlab='n', ylab='t(n)/log(n)')

Simple, eh?

What about the base of the logarithm?

In the algorithm my students are implementing, the \(\log n\) in \(O(n \log n)\) isn't the natural logarithm.  It rarely is in computer science.  It isn't even the base-2 logarithm as it often is.  The base actually depends on the kind of input (but let's keep ignoring that).

What do you do when you don't know the base of the logarithm and you want to validate an \(O(n \log n)\) algorithm?  Do you need to divide \(t(n)\) with the right base log to get a line?

Of course not.  Remember \(\log_b x = \frac{\log_k x}{\log_k b}\), so if the running time is with a log base \(b\): \(t(n) = \beta \cdot n \log_b n = \frac{\beta}{\log_k b} \cdot n \log_k n\), and you plot it with a log base \(k\), then \(t(n)/\log_k n = \frac{\beta}{\log_k b} \cdot n\) is still a line.

ns <- seq(1,1000)
ts <- 5 * ns * log10(ns)                # base 10 logarithm

## plotting with base 2 logarithm
plot(ns, ts / log2(ns), type='l', col='blue', xlab='n', ylab=expression(t(n)/log[2](n)))

That is why, when we write the complexity as \(O(n \log n)\) we don't bother with the base.  All logs are the same under a big-O.

In practice, of course, the base of the logarithm matters as it affects the constant in front of the \(n \log n\), but for the big-O notation it doesn't, and for validating that your program runs in time \(O(n\log n)\) it doesn't either.

Is R an 'epic fail'?

Monday, April 26th, 2010

Is R an 'epic fail'?

Something as popular and widespread as R can hardly be called a 'failure' in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.

For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language.  If you don't want to do any programming, don't muck about with R!  There are lots of visualisation tools and statistical tools that are much easier to use.

Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn't a bad choice.

You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn't any way around it.  Just like there isn't any Royal Road to mathematics (as Euclid is supposed to have said).

Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?

Because the professionals are using it!

Thursday, April 22nd, 2010

Why use R?  Because that is what the professionals use!

The moral of the story is not that one should blindly use professionals’ tools; that can be just as bad as ignoring them. You don’t need to use Google or Facebook’s infrastructure to run your personal blog, just as developers shouldn’t employ the GPL simply because it’s the same license Linux or MySQL uses. But if you want a no-bullshit take on which technologies actually deliver, you could do a lot worse than watching what the professionals use.

R flashmob project

Friday, September 4th, 2009

The R Flashmob Project:

You are invited to take part in R Flashmob, the project that makes the world a better place by posting helpful questions and answers about the R statistical language to the programmer’s Q & A site stackoverflow.com

The event takes place on September the 8th at stackoverflow.com, follow the link above to find the time in your time zone.

--

247-255=-8

Uncertainty and randomness, part II

Monday, August 10th, 2009

Ok, so in my previous post I considered tossing a (biased) coin and trying to figure out the probability of heads over tails.  The probability of tossing heads, \(p\) is a property of the coin and fixed, it is just unknown, so there will be some randomness in estimating it due to the randomness in coin tosses we use to estimate.

In the Bayesian analysis I used, I treated the parameter \(p\) as stochastic, because that is how that kind of analysis works, but it never really was.  In this post I will consider a random \(p\).

I am going to consider an experiment where I am picking coins out of a bag, tossing them and recording whether they are heads or tails, and then throwing away the coin to pick another one.  I think that the coins are biased, but I don't know how.  Can I figure that out?

Formalizing this mathematically, I am assuming that I am drawing coins each with a \(p\) probability of coming up head, where the \(p\)s are beta distributed with (unknown) parameters \(\alpha\) and \(\beta\).  I get to toss each coin once to see if it turns up heads or tails, and then I have to throw the coin away, so for coin \(i\) I have a Bernoulli distribution with parameter \(p_i\).

a <- 3 ; b <- 4
n <- 100
outcomes <- rbinom(n,1,rbeta(1,a,b))
s <- sum(outcomes)

The task is to figure out the distribution of \(p\)s, which means estimating hyper parameters \(\alpha\) and \(\beta\).

Any ideas?

Doing the same as last time?

In the setup in my last post, we modelled the situation very similarly.  We drew a \(p\) at random and then tossed coins and got a distribution for \(p\) that way.  Could that work again?

We could pick a prior for \(p\), say \(\mathrm{beta}(2,2)\) as last time, calculate the posterior based on the number of heads out of \(n\) tosses and see if that tends toward the real distribution of \(p\)s.

plot.analysis <- function(prior.a,prior.b ,real.a,real.b, n) {
  x <- seq(0,1,length.out=1000)

  # real p distribution
  real <- dbeta(x,real.a,real.b)

  # get prior
  prior <- dbeta(x,prior.a,prior.b)

  # perform experiment
  s <- sum(rbinom(n,1,rbeta(1,real.a,real.b)))
  # get posterior
  post <- dbeta(x,prior.a+s,prior.b+n-s)

  # plot results...
  plot(x, post, main='Prior, likelihood and posterior',
       type='l', xlab='', ylab='')
  lines(x, real, col='green')
  lines(x, prior, col='red')
}

opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.analysis(2,2, 3,4, 5)
plot.analysis(2,2, 3,4, 10)
plot.analysis(2,2, 3,4, 50)
plot.analysis(2,2, 3,4, 100)
par(opar)

(Ups, the title for these plots are wrong ... there is the real distribution of \(p\)s but no likelihood ... but I really cannot be bothered to re-do the plot right now...)

Ok, clearly that didn't work.  Why?

For one thing, even if \(p\) is stochastic in this setup, there is only one \(p\), but in the setup we are modelling there is one for each coin toss.

By the very nature of the beta distribution, the more heads and tails we connect, the "pointy'er" it will become, so if the real distribution of \(p\)s is relatively flat, it will look less and less like it.

A simple maximum likelihood solution

Since all we have to do to estimate the distribution of \(p\)s is to estimate the hyper parameters \(\alpha\) and \(\beta\), we could just take the maximum likelihood approach.  We just need to work out the likelihood of our model and then maximize it with respect to the two.

If \(\pi(p;\alpha,\beta)\) is the distribution of \(p\)s, then the likelihood of \(p\)s and heads and tails is

\[\mathrm{lhd}(\alpha,\beta,p_1,\ldots,p_n) = \prod_i p_i^{h_i}(1-p_i)^{1-h_i} \pi(p_i)\]

where \(h_i\) is 1 if the ith coin toss is head and 0 otherwise.

We don't want to maximize with respect to all the \(p_i\)s - and with one coin toss for each we don't have a lot of information to do that anyway - but we can just integrate them out, and since they are independent that gives us

\[\mathrm{lhd}(\alpha,\beta) = \prod_i \int p_i^{h_i}(1-p_i)^{1-h_i} \pi(p_i;\alpha,\beta) \,\mathrm{d}p_i\]

or

\[\log\mathrm{lhd}(\alpha,\beta) = s \log \int p\pi(p;\alpha,\beta) \,\mathrm{d}p + (n-s)\log \int (1-p)\pi(p;\alpha,\beta) \,\mathrm{d}p\]

where \(n\) is the number of coin tosses and \(s\) the number of heads.

With a beta prior for \(p\)

\[\pi(p;\alpha,\beta) = \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}\]

where \(B(\alpha,\beta)=\int p^{\alpha-1}(1-p)^{\beta-1} \,\mathrm{d}p\) we get

\[\int p\pi(p;\alpha,\beta) \,\mathrm{d}p = B(\alpha+1,\beta)/B(\alpha,\beta)\]

and

\[\int (1-p)\pi(p;\alpha,\beta) \,\mathrm{d}p = B(\alpha,\beta+1)/B(\alpha,\beta)\]

so the log-likelihood is

\[\log\mathrm{lhd}(\alpha,\beta) = s\log B(\alpha+1,\beta) + (n-s)\log B(\alpha,\beta+1) - n\log B(\alpha,\beta).\]

You can probably optimize this with respect to \(\alpha\) and \(\beta\) analytically, but I'm just going to use a numerical optimisation to try it out:

a <- 3 ; b <- 4
n <- 100
s <- sum(rbinom(n,1,rbeta(1,a,b)))
logL <- function(a,b, n,s) {
  s*lbeta(a+1,b) + (n-s)*lbeta(a,b+1) - n*lbeta(a,b)
}
search <- nlm(function (x) -logL(x[1],x[2],n,s), c(3,4))
mle.a <- search$estimate[1]
mle.b <- search$estimate[2]
opar <- par(mfrow=c(1,2))
xs <- seq(1,10,length.out=500)
ys <- sapply(xs, function(x) logL(x,4,n,s))
plot(xs,ys,type='l',xlab=expression(alpha),main='Likelihood',ylab='logL')
abline(v=mle.a)
mtext(mle.a,at=mle.a)
ys <- sapply(xs, function(x) logL(3,x,n,s))
plot(xs,ys,type='l',xlab=expression(beta),main='Likelihood',ylab='logL')
abline(v=mle.b)
mtext(mle.b,at=mle.b)
par(opar)

and we can plot the true \(p\) density to compare it with the inferred:

x <- seq(0,1,length.out=100)
true <- dbeta(x,3,4)
inferred <- dbeta(x,mle.a,mle.b)
plot(x,true,type='l',col='red',
     main='True and inferred distribution',
     xlab='', ylab='',
     ylim=range(true,inferred))
lines(x,inferred, col='blue')

You might get different results here; when I experimented with it I didn't always get results this close to the true values, but I picked a nice result for the plots here...

--

222-223=-1