Posts Tagged ‘R’

Uncertainty and randomness, part I

Sunday, August 9th, 2009

When I wrote about confidence intervals the other day, I was actually inspired to think about it from two papers from here:

The topic of those two papers is two kinds of uncertainty: epistemic uncertainty and aleatory uncertainty, where the former is uncertainty due to lack of knowledge while the latter is uncertainty due to randomness.

If we’re trying to estimate a parameter as in the confidence interval post, we are dealing with lack of knowledge.  The parameter is assumed fixed and not random at all, but we don’t know what it is so there is some uncertainty in our estimate.  The more data we have available, the better we can estimate the parameter, so it is possible for us to reduce the uncertainty by increasing our knowledge.

If we want to predict if the toss of a fair coin comes up head, we are dealing with randomness.  If it is a fair coin, then there is a 50/50 chance of the next flip coming up head, and no matter how many times we repeat the experiment the uncertainty about the outcome is the same.

I’m not going to write much more about this, since you are better off reading the papers anyway, but it got me thinking about uncertainty in flipping (un-)fair coins and that is what this post and the next is about.

The stuff below doesn’t really relate to epistemic or aleatory uncertainty, so you don’t have to read the papers to read on here.  It is just two different models of coin tossing and having some fun with that…

Unknown bias

First let us consider a scenario where we have a coin that we do not necessarily know if is fair or not, but we want to figure out the probability of tossing head.  So the model of the system is a simple Bernoulli distribution where we have a single parameter p and where we toss head with probability p and tail with probability 1-p.

If we toss the coin n times the likelihood of getting s heads is binomially distributed and the likelihood is proportional to p^s(1-p)^{n-s}.  The maximum likelihood estimate of p is s/n.

You can play around with that yourself with a bit of R code as below:

p <- 0.39
ns <- seq(1,100)
est <- sapply(ns, function(n) rbinom(1,n,p)/n)

opar <- par(mar=c(5,4,2,4)+0.1)
plot(ns, est, ylim=c(0,1),
     ylab='Maximum likelihood estimates (s/n)',
     xlab='Number of tosses')
abline(h=p)
mtext(paste(" p = ", p), at=p, side=4, las=1)
par(opar)

The parameter is fixed – in the example above as 0.39 – and the entire uncertainty is in the estimate.  That uncertainty decreases with the number of tosses – if you want to you could play around with confidence intervals and see this.  I won’t do this here but instead use credible intervals since I used confidence intervals in the previous post.

Anyway, as I said, the parameter is fixed but unknown.  In a “frequentist” analysis we would really consider the parameter fixed and consider the estimate random.  In a “Bayesian” analysis, on the other hand, we would still consider the parameter random.  We would just learn more and more of the distribution of it as we toss the coin.

Whether you can consider an unknown but fixed parameter (uncertainty due to lack of knowledge) as a stochastic variable (uncertainty due to randomness) is a philosophical question more than a mathematical one, and you are welcome to join in on the frequentist vs Bayesianist fight any day, I am sure.

I’m going to do a Bayesian analysis here, since I did a frequentist analysis in the confidence interval post, but I’ll be back to a frequentist analysis in the next post.

Priors and posteriors

Ok, so to analyse this system in a Bayesian way, we need to model the parameter p as a stochastic variable.  We do this by giving it a prior distribution \pi(p).  This should capture the uncertainty we have about p before we start tossing the coin.

In a frequentist analysis we have no way of quantifying our intuition about p, such as we expect it to be closer to 0.5 than to 0 or 1.

On the other hand, because we need to provide a prior, we have no way in a pure Bayesian analysis of claiming complete prior ignorance.  There is a concept of “uninformative priors” or “reference priors”, but claiming ignorance in Bayesian analysis is not as easy as you might think.  Anyway, that is way beyond the scope of this blog post, so I’ll be on with the analysis.

For mathematical convinience we prefer something called conjugate priors, and for Bernoulli distributions that is the beta distribution.  Don’t worry about what a conjugate prior is, it is not essential for the analysis here.

The beta distribution is parameterised with two “hyper parameters” \alpha and \beta and has density \pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1} with normalisation constant 1/B(\alpha,\beta) where of course

B(\alpha,\beta)=\int_0^1 p^{\alpha-1}(1-p)^{\beta-1}\,\mathrm{d}p.

Depending on \alpha and \beta the beta distribution can take various shapes.  Whenever \alpha=\beta the density is symmetric and (when \alpha=\beta\geq 1) with mode at 0.5, so if your prior belief is that coins are likely to be unbiased, then you want equal \alpha and \beta.  The greater the parameters, the more “pointy” the density is around the mode, so the higher \alpha and \beta the stronger you make your prior belief about the distribution of p.  With \alpha=\beta=1 you get a uniform distribution.

You can try it out with this R code

plot.beta <- function(a,b) {
  p <- seq(0,1,length.out=100)
  d <- dbeta(p,a,b)
  plot(p,d,
       main=substitute(paste(alpha==a, " and ", beta==b),
                       list(a=a,b=b)),
       type='l', xlab='', ylab='')
}
opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.beta(1,1)
plot.beta(2,2)
plot.beta(4,3)
plot.beta(2,6)
par(opar)

The prior distribution is capturing the knowledge we have about p (or think we have about it) before we do any experiments by tossing the coin.  Once you’ve picked one that captures your intuition about how you would expect p to be, the trick is to use the experiments to change the distribution of p to now capture both your prior expectation and the knowledge you have gained about p from doing your experiments.

Let \mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s} denote the likelihood of p when we observe s heads out of n tosses, i.e. \mathrm{lhd}(s,n;p) is the probability of s heads out of n tosses conditional on p.  The joint probability of s, n and p is then \mathrm{lhd}(s,n;p)\pi(p) and the posterior probability for p, P(p) is proportional to this P(p)\propto \mathrm{lhd}(s,n;p)\pi(p).  This is always the case; the posterior is always proportional to the prior times the likelihood, a result you might know as Bayes’ rule.

The posterior probability is interesting because it gives you a distribution for p that depends on the results of your experiment.

In many way this posterior does what the confidence intervals does in frequentist analysis.  It captures the parameter estimate and the uncertainty in it.  The difference here is that in the Bayesian approach the parameter really is considered a random parameter, distributed as the prior before the experiment and as the posterior after the experiment.

Okay, since \mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s} and \pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1} the posterior is P(p) \propto p^s(1-p)^{n-s}p^{\alpha-1}(1-p)^{\beta-1}=p^{s+\alpha-1}(1-p)^{n-s+\beta-1}.

So if we start with a \mathrm{beta}(\alpha,\beta) prior and observe s heads in n trials, we end up with a \mathrm{beta}(\alpha+s,\beta+n-s) posterior.  It is this nice relationship that makes the beta distribution a conjugate prior, in case you are wondering.

When coming up with the prior, you can therefore think of \alpha and \beta as “pseudo counts” of heads and tails.  Results you never actually saw, but they behave just as counts of head and tail.

We can try out the Bayesian analysis like this:

plot.analysis <- function(a,b,p,n) {

  # get prior
  x <- seq(0,1,length.out=1000)
  prior <- dbeta(x,a,b)

  # perform "experiment"
  s <- rbinom(1,n,p)

  # get likelihood
  lik <- dbeta(x,s,n-s)

  # get posterior
  post <- dbeta(x,a+s,b+n-s)

  # plot results...
  plot(x, post, main='Prior, likelihood and posterior',
       type='l', xlab='', ylab='')
  lines(x, lik, col='blue')
  lines(x, prior, col='red')
  abline(v=p, lty='dashed')
}

opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.analysis(2,2,0.39,5)
plot.analysis(2,2,0.39,10)
plot.analysis(2,2,0.39,50)
plot.analysis(2,2,0.39,100)
par(opar)

Here I’m plotting the prior – where I’ve chosen a beta(2,2) prior – in red, the likelihood in blue and the posterior in black.

When we have only done a few experiments, the prior will dominate and we see the posterior as somewhere in between the likelihood and the prior.  With many experiments, the likelihood dominates and the posterior resembles it more and more.  This is as it should be, considering the the more data we have from experimenting, the more we should trust the experiments over our previous intuition.

Credible intervals

With a posterior in hand, we have all the information about our parameter that we can get in a Bayesian framework, but the posterior is not always easy summarise (or visualise in high dimensions) so we can also reduce it to an interval in which the parameter is likely to be.

This is like the confidence intervals in frequentist analysis, but this time the parameter really is stochastic so we have a fixed interval, the credible interval, where there is a, say, 95% chance that the parameter falls in.

To get the 95% credible interval, we just get the 0.025 and 0.975 quantiles of the posterior.

qbeta(c(0.025,0.975),a+s,b+n-s)

If the experimental results were really fixed in stone (so the number of heads and number of tails were fixed) and the coin bias random, this would be a fixed interval where the parameter p would fall within in 95% of the cases.

# data  -- assumed fixed forever...
n <- 100
s <- 47

# prior
a <- 2 ; b <- 2

# credible interval
CI <- qbeta(c(0.025,0.975), a+s, b+n-s)

# samples of p
ps <- rbeta(100, a+s, b+n-s)

plot(ps, ylim=c(0,1), main='Credible interval and samples of p',
     ylab='Samples')
abline(h=CI, lty='dashed')

Of course, in this case this is really only something we pretend.  It is in fact p that is fixed and we can re-do the experiment whenever we want, and in that case we would see a fixed p and a random credible interval, so there really is very little difference between credible intervals and confidence intervals in practice…

# experiments
p <- 0.39
n <- 100
s <-
ns <- seq(1,100, by=5)
heads <- sapply(ns, function(n) rbinom(1,n,p))

# prior
a <- 2 ; b <- 2

# point estimate
modes <- mapply(function(n,s) (a+s)/(n+a+b), ns, heads)

# credible interval
lower <- mapply(function(n,s) qbeta(0.025,a+s,b+n-s), ns, heads)
upper <- mapply(function(n,s) qbeta(0.975,a+s,b+n-s), ns, heads)

plot(ns, modes, ylim=c(0,1),
     main='Point estimates and credible intervals',
     xlab='Number of tosses', ylab='Estimate')
lines(ns, lower)
lines(ns, upper)
abline(h=p, lty='dashed')

Ok, enough about this setup.

In a later post I will consider an actual random p parameter and see how we can estimate its distribution from coin tosses.  Stay tuned!

221-222=-1

True positives and false positives

Thursday, July 2nd, 2009

Following up on my rant on p-values, I want to say something about true and false positives in a classical statistical hypothesis test.

Such a test works as follows: we observe a value – assumed to be from some stochastic distribution – and calculate a p-value.  We then check if that p-value is below some significance level – typically 5% – and if it is we say that it is “significant” (or a positive) and if it is not we say that it is “not significant” (or a negative).

That is all there is to it.

Now, the p-values says something about the probability of false positives, those observations that are really sampled from the null distribution but are judged positives by the test.  With a 5% significance level, we expect that 5% of values truly sampled from the null distribution will be significant.  In other words, if we make a number of observations, and all our observations are from the null distribution, then we should get positives for about 5% of our tests.

Of course we do not really believe that all our observations are from the null distribution.  If we did, we wouldn’t make any tests.  What we actually believe is that we have some mixture of values from either the null distribution, P_0 or some alternative distribution, P_1, so our values are drawn from \pi_0P_0 + \pi_1P_1 where \pi_0 is the probability that the sample is from the null distribution and \pi_1=1-\pi_0 is the probability that it is from the alternative distribution.

An example: association mapping

In association mapping we test genetic markers, to figure out if any marker is associated with some a disease.  We do not believe that all markers are associated with the disease, nor do we believe that none of them are.  In the mixture above, \pi_0 would be the (prior) probability that any given marker is not associated with the disease, while \pi_1 would be the probability that it is associated with the disease.

We will simplify the setup a bit.  We will consider a marker where one allele is found in 40% of the population, say that those without that marker have a fixed disease risk while those with the marker has an increased risk, a genetic relative risk, GRR, such that if the wild types have risk r then the mutants (those with the 40% allele) have the risk GRR\cdot r.  We sample N individuals, categorise them according to allele and according to disease, and then make a \chi^2 test for association.

create.sample <- function(N, mutant.freq, wt.risk, GRR) {
  # Sample N individuals where N*mutant.freq are mutants.  Assign
  # cases and controls status based on the wildtype risk wt.risk and
  # the mutant risk based on the genetic relative risk GRR*wt.risk.
  mutant.status <- runif(N) < mutant.freq
  disease.status <- sapply(mutant.status,
                           function (m)
                           ifelse(m,
                                  runif(1) < GRR*wt.risk,
                                  runif(1) < wt.risk))
  ftable(mutant.status, disease.status)
}

data <- create.sample(2000,0.4,0.05,2)
chisq.test(data)

So here I sample 2000 individuals, say that the wildtype risk is 5% and that the mutants have a GRR of 2, so twice the risk of the disease.

Since this sample has a GRR of 2 – twice the risk for mutants – we are sampling from the alternative distribution.  So if the p-value is significant we have a true positive – a signal where there should be one – while if the p-value is not significant we have a false negative – no signal where there should be one.

If, instead, we sample with a GRR of 1 – the mutant risk is the same as the wild type – a significant value would be a false negative while a non-significant p-value would be a true negative.

With me so far?  Good.

Now we can set up the mixture from above.

sample.pvalue <- function(N,mutant.freq,wt.risk,GRR) {
  chisq.test(create.sample(N,mutant.freq,wt.risk,GRR))$p.value
}

sample.mixture <- function(true.risk, n, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < 0.05
  ftable(significant, true.or.false)
}

The real setup for association mapping is somewhat more complex, since there we never sample new individuals for each marker – so the tests are not independent – and the allele frequency varies for each marker and so on.  No matter, this setup is good enough for a blog.

Sampling true and false positives and negatives

Ok, now we can try sampling from the mixture.

First, we can try \pi_1=0.  That is, we can try sampling from a distribution of markers that are guaranteed not to be associated with the disease.

> sample.mixture(0, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE
significant
FALSE                       964
TRUE                         36

We only get “false” signals – since none of the markers are associated with the disease, but we still get some positives.  False positives, of course.  We expect about 5% – so 50 since we sample 1000 markers.  We get 36 instead of 50, since this is a random process, but you get the point.

If, instead, we sample with \pi_1=1 we only see associated markers, but we still test them, so some of them might end up being classified as negatives.  False negatives, of course.

> sample.mixture(1, 1000, 2000, 0.4, 0.05, 2)
            associated TRUE
significant
FALSE                       10
TRUE                       990

The significance level doesn’t tell you anything about the expected number of false negatives under the alternative distribution.  In a classical hypothesis test, the alternative distribution is completely ignored.  Only the null distribution matters.  This is one of the reasons I don’t particularly like this type of tests, but that is a different story.

The point is just, that even with all values “true” we still see some negatives.

With a mixture of associated and not associated markers, say 50%/50%, we expect both associated and not associated markers and both positives and negatives, of course.

> sample.mixture(0.5, 1000, 2000, 0.4, 0.05, 2)

 associated FALSE TRUE
significant
FALSE                       490    5
TRUE                         20  485

We want to find as many true positives as possible, while getting as few false positives as possible.  We can only see the signifcance status, of course, so we cannot tell which of the positives are true or false.

In the case above, we find as significant 485 of the associated markes – and miss five of them – and we get only 20 false positives.  Not bad, but then, 50% of the markers were true associations to begin with.

What if \pi_1=0.1?

> sample.mixture(0.1, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    876    2
TRUE                      38   84

or \pi_1=0.05?

> sample.mixture(0.05, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    893    2
TRUE                      49   56

We still do okay at picking up the associated markers.  In both cases we only miss two of them (but of course the actual number depends on the randomness in the sample).  We just see fewer of the true associations in the tests.

The ratio of true to false positives changes as well.  We still only expect 5% of the non associated markers to be significant, but there are just more of them now, and if we reduce the frequency of associated markers to 1% we now see more false than true positives:

> sample.mixture(0.01, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    953    0
TRUE                      43    4

In a real association mapping study, we don’t expect 1% of the markers to be associated with the disease.  If we test a million markers, we don’t expect more than a few tens to a few hundreds to really be true associations, so \pi_1 is pretty low indeed.

With a significance value of 5% the true positives would completely drown in the sea of false positives.

Power

Formally, statistical power is the probability of getting a significant p-value when the sample is from the alternative distribution.  It is the dual to the significance value.

The significance value doesn’t say anything about what happens to observations from the alternative distribution.  The power doesn’t say anything about what happens to values from the null distribution.

It just tells you how likely it is that you detect values from the alternative distribution as significant.

In our example above, the power is pretty good.  We recognize pretty much all of the associated markers as significant.

With a sample size of 2000, a high risk allele frequency and a relative risk of 2, we get a pretty strong signal.

Our only real problem is that we still detect a lot more false associations than true associations.

Boosting the power, which essentially amounts to increasing the sample size since that is the only variable we are in control of, doesn’t help on this at all!

If you increase the sample size, you still get 5% of the false associations as significant.  That is just how the significance value works.

The only way to reduce the number of false positives is to lower the significance value.  If you pick 1% you only get 1% of the false associations as significant.  With 0.1% you only get 0.1%.

This is why we do multiple test correction.

Multiple test correction

Multiple test correction really just means lowering the significance threshold.  There are different ways of doing it, but it pretty much all amounts to figuring out how much to reduce the significance threshold down to a level where you expect few false positives.

Let’s try it out.  We update our code

sample.mixture <- function(true.risk, n, sig.level, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < sig.level
  ftable(significant, associated)
}

and sample again.  Say with a significance level of 0.01%:

> sample.mixture(0.01, 1000, 0.0001, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    989    6
TRUE                       0    5

We reduce the number of false positives – which is good – but we also reduce the number of true positives – which is bad.

If we reduce the significance value, we also make it harder for samples from the alternative distribution to get p-values below the threshold.  We reduce the power.

The example above is not actually that bad.  We still find half of the true associations.  But then, a significance level of 0.01% is actually a bit high for a genome wide association study.  If, there, we test 1 million markers, and are willing to get around 1 false positives, we should use a significance value of one in a million.

> sample.mixture(0.01, 1000, 1e-6, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    986   12
TRUE                       0    2

The more tests you make, the lower a significance value you need, if you want to keep the expected number of false positives constant.

Of course, what you really want is not to keep the number of false positives fixed but rather to get a good ratio of false to true positives, but I’ll leave that for another post.

The point here is just that if you want to keep the number of false positives down, you will end up reducing your power as well.

Of course, now you have a reason for increasing the sample size.  That will make the alternative distribution less similar to the null distribution and therefor more likely to get small p-values.  It still won’t do anything to the distribution of p-values from the null distribution, but it will change the distribution of p-values from the alternative distribution.

I wanted to say something about the case where the “actual” null distribution is not really the mathematical null distribution, but this post is getting pretty long already, so I think I’ll leave that for another post, so stay tuned…

183-186=-3

Minimisation

Sunday, June 7th, 2009

Just for something to do on a lazy Sunday afternoon – where the only “excitement” was voting for the EU parliament (and the law concerning the succession to the Danish throne that won’t be relevant for the next two generations) – I played with R a little bit.

I should be reading three Master’s theses (I’m censor on three defenses later this month) but I got to bed too late yesterday and feel too tired and lazy to do a proper job on that…

Anyway, I decided to play with the gradient descent method for function minimisation. In my machine learning class I told my students that they should find a library for minimising functions rather than the simple gradient descent method, ’cause that is what I would normally do, but never being too keen on practising what I preach, I decided to experiment a bit with it.

Just for a function of a single value, to keep it simple, but still.

I picked the function f(x) = -x^3 + 10x^2 + 3x – that’s as good as any and I can analytically find the minima so I can easily check the results.  Well, one minimum is minus infinity, but I am not interested in that.

To get that minima, just set the derivative f'(x) = -3x^2 + 20x + 3 to zero and pick the negative solution (the positive is the local maximum).

> f <- function(x) -x**3 + 10*x**2 + 3*x
> df <- function(x) -3*x**2 + 20*x + 3
> roots <- polyroot(c(3,20,-3))
> roots
[1] -0.1467688+0i  6.8134355+0i

So with that out of the way, we can get started.

The Newton-Raphson method

The gradient descent method is a simplification of the Newton-Raphson method. The idea behind the latter is to do a Taylor expansion of f(x) around some local minimum x', f(x') \approx f(x)+f'(x)(x'-x) + \frac{1}{2}f''(x)(x'-x)^2 where we cut the expansion short after the quadratic term.

Take the derivative, with respect to x’-x, on both sides 0 \approx f'(x) + f''(x)(x'-x) and we get x' \approx x - \frac{f'(x)}{f''(x)}.  The goal now is to replace \approx with =, that is find x=x', and we do that iteratively, starting in some candidate x_0 and then at iteration i set x_i = x_{i-1}-\frac{f'(x_{i-1})}{f''(x_{i-1})}.

Repeat until you reach a fixed point, or at least until x_i is reasonably close to x_{i-1} or f'(x_i) is reasonably close to zero.

In R:

NewtonRaphson <- function(x0, df, ddf, n=100) {
  steps <- matrix(0, nr=n, nc=3)
  steps[1,] <- c(x0, df(x0), ddf(x0))
  x <- x0
  for (i in 2:n) {
    x <- x - df(x)/ddf(x)
    steps[i,] <- c(x, df(x), ddf(x))
  }
  steps
}

where I just run for n iterations and keep track of the derivatives along the way so I can inspect the results later.

This method has the nice property that if you start close to a minimum, the distance to that minimum drops off quadratically as a function of iterations, so it converges pretty quickly.

> ddf <- function(x) -6*x + 20
> nr.steps <- NewtonRaphson(0, df, ddf)
> head(nr.steps)
           [,1]          [,2]     [,3]
[1,]  0.0000000  3.000000e+00 20.00000
[2,] -0.1500000 -6.750000e-02 20.90000
[3,] -0.1467703 -3.129221e-05 20.88062
[4,] -0.1467688 -6.737721e-12 20.88061
[5,] -0.1467688  4.440892e-16 20.88061
[6,] -0.1467688 -8.881784e-16 20.88061

In this case, we are converged after only four steps if we start in x_0=0.

The gradient descent method

The gradient descent method doesn’t use the second derivative, but replaces \frac{1}{f''(x)} with a constant \lambda. Useful if you don’t have an expression for the second derivative (or if it is costly to compute).

The iteration goes like x_i = x_{i-1}-\lambda f'(x_{i-1}) and the R code looks like this:

GradientDescent <- function(x0, df, lambda, n=100) {
  steps <- matrix(0, nr=n, nc=2)
  steps[1,] <- c(x0, df(x0))
  x <- x0
  for (i in 2:n) {
    x <- x - lambda*df(x)
    steps[i,] <- c(x, df(x))
  }
  steps
}

From the Newton-Raphson run above, we saw that f''(x) was pretty close to 20, so we can try running with \lambda=1/20:

> gd.steps <- GradientDescent(0, df, 1/20)
> head(gd.steps)
           [,1]          [,2]
[1,]  0.0000000  3.000000e+00
[2,] -0.1500000 -6.750000e-02
[3,] -0.1466250  3.003328e-03
[4,] -0.1467752 -1.321765e-04
[5,] -0.1467686  5.819939e-06
[6,] -0.1467688 -2.562555e-07

where again we converge pretty quickly.

Of course, if we don’t know f''(x) close to the minimum, or if it varies significantly around the minimum, we might not be so lucky that we can pick a good \lambda.

and if we take too large steps, say \lambda=1/10 we might not converge at all but keep over-shooting in the update.

One possible solution is to use a small \lambda but add a momentum to the iteration step, so it becomes x_i=x_{i-1}+m_i where m_i=-\lambda f'(x) + \mu m_{i-1} and where 0\leq\mu\leq 1.  For \mu=0 this is just the plain old gradient descent method, but for \mu>0 the idea is that if we take a large step, the next one should probably also be large so the momentum increases, and when we start making smaller steps it decreases again.

GradientDescentMomentum <- function(x0, df, lambda, mu, n=50) {
  steps <- matrix(0, nr=n, nc=3)
  steps[1,] <- c(x0, df(x0), 0)
  x <- x0
  m <- 0
  for (i in 2:n) {
    m <- -lambda*df(x) + mu*m
    x <- x + m
    steps[i,] <- c(x, df(x), m)
  }
  steps
}

Of course, now we are left with two rather than one parameter to the method that we need to come up with a guess for.

We still have the problem that we won’t converge if we make a guess for \lambda that is too large, and in my experiments we do not gain a whole lot from the momentum if we pick it too small either.

Plus the \mu term has to be chosen carefully as well

so I’m a bit underwhelmed by this idea.

Ok, enough computer time for me today, I need to rest my RSI, so I’ll stop here…

158-161=-3

Playing some more with Bayesian linear regression

Monday, May 25th, 2009

Ok, so I decided to play a bit more with Bayesian linear regression.  I was aiming for something that would give me a smaller variance around x values where I have a lot of samples, and since I could not get that with a “proper linear model” I hacked it about a bit.

I’m using the same data as before

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year, nov, dec))

but a different model.  One that looks a bit more like a kernel method (that is how I get the desired feature), but is still formulated as a linear regression model.

The notation and math is taken from Bishop Pattern recognition and machine learning.

The model

The main difference is the way the relationship between the explanatory variables, x_i, and dependent variables, y_i, is modelled.  The explanatory variables are not used directly, but mapped into a vector of “features”, \phi(x_i) that is then used in the linear model y_i \sim N( \mathbf{w}^T \phi(x_i), 1/\beta).  For convenience we work with the precision \beta rather than variance \sigma^2=1/\beta.

This is still a linear model, only it is linear in the weights we train, \mathbf{w}, and in the features \phi(x_i), but not in the explanatory variables x_i.

I’ve used five features: constant 1 to get an intercept, and four normal densities, all with the same variance but with means in 0, 50, 100, and 150, respectively.  That covers the range of x_i values nicely, so that should give me an okay feature set.

Implemented in R, the \phi(x) function is then:

phi <- function(x)
     c(1, dnorm(x, 0, 10), dnorm(x,50,10), dnorm(x,100,10), dnorm(x,150,10))

I guess there is no point in trying to interpret this model.  For the linear model yesterday, we can use the parameters to reason about the underlying system.  Not so for this arbitrary set of features.  This is a model that is only aimed at predicting new dependent variables; it is not trying to model some underlying physical reality we can reason about from the model.

Maximum likelihood predictions

Let \Phi be the matrix with a row per (x_i,y_i) in our data set, where row i is \phi(x_i).

Phi <- t(sapply(rain$nov, phi))

The maximum likelihood estimate of the weights is given by \mathbf{w}_{\mathrm{ML}} = \left(\Phi^T\Phi\right)^{-1}\Phi^T\mathbf{y} and the maximum likelihood of the precision is given by 1/\beta_{\mathrm{ML}}=\frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{w}_{\mathrm{ML}}\phi(x_i))^2.

wML <- solve(t(Phi)%*%Phi) %*% t(Phi) %*% rain$dec
predictions <- sapply(rain$nov, function(x) t(wML) %*% phi(x))
beta = 1/(sum( (rain$dec - predictions)**2 )/N)

Given a new x value, calculating the best predictions for the corresponding y value or the 95% HDR is straight forward.

xs <- seq(0, 150, length.out=100)
ys <- sapply(xs, function(x) t(wML) %*% phi(x))
lines(xs, ys, col='darkgreen')
lines(xs, qnorm(0.025, ys, sqrt(1/beta)), col='green')
lines(xs, qnorm(0.975, ys, sqrt(1/beta)), col='green')

Below I’ve plotted this for five of the data points and the full data with 10 data points.  (There’s some numerical problems when I use the first five data points, so I’ve used the last 5, but you get the idea).

Notice that here the width of the HDR does not depend on the x value.  The variance is estimated from the entire data set and just used in each position.

Bayesian predictions

For the Bayesian approach I’ll assume that the precision is know (and equal to the \beta I get from the maximum likelihood estimate).  This is just because it makes it easier, since then I have all the equations I need from my text book.  Otherwise I’d have to integrate it out and get some t density instead of the normal, but I won’t bother.

With that assumption, I only need to worry about the weights, which will get the conjugate prior p(\mathbf{w}) \sim N(\mathbf{m}_0, \mathbf{S}_0).

I use a prior mean of zero, \mathbf{m}_0=\mathbf{0}, and the covariance matrix \mathbf{S}_0=\alpha^{-1}\mathbf{I}.

m0 <- c(0,0,0,0,0)
alpha <- 1e-10
S0 <- diag(1/alpha,5)

So \alpha is the precision of the prior.  The smaller the precision, the less influence the prior has on the estimate and for \alpha\to 0 the maximum posterior estimate will tend towards the maximum likelihood estimate.  I’ve picked a pretty small \alpha to get a prediction close to the ML estimate, since I am mainly interested in the HDRs here and not the point estimates.  You can play around with it yourself to see how \alpha affects the predictions.

The posterior distribution of the weights is p(\mathbf{w}\,|\,\mathbf{x},\mathbf{y}) = N(\mathbf{m}_N,\mathbf{S}_N) where

\mathbf{S}^{-1}_N = \mathbf{S}^{-1}_0 + \beta\Phi^T\Phi

\mathbf{m}_N = \mathbf{S}_N\left(\mathbf{S}^{-1}_0\mathbf{m}_0 + \beta\Phi^T\mathbf{y}\right)

SN <- solve(solve(S0) + beta * t(Phi)%*%Phi)
mN <-SN %*% (solve(S0) %*% m0 + beta * t(Phi) %*% rain$dec)

This can be simplified a bit with the prior I’ve used, but let’s just use the general equations here.

For a new explanatory value, x, the corresponding dependent variable will be distributed as y \sim N(\mathbf{m}_N^T\phi(x), \sigma_N^2(x)) where \sigma^2_N=\frac{1}{\beta}+\phi(x)^T\mathbf{S}_N\phi(x), and from this we can get the maximum posterior prediction (the mode of the posterior, which is the mean for the normal distribution) and the 95% HDR

predict.mean <- function(x) t(mN) %*% phi(x)
predict.sd <- function(x) sqrt(1/beta + t(phi(x)) %*% SN %*% phi(x))
qpredict <- function(x,p) qnorm(p, predict.mean(x), predict.sd(x))

lines(xs, sapply(xs, predict.mean), col='blue')
lines(xs, sapply(xs, qpredict, p = 0.025), col='lightblue')
lines(xs, sapply(xs, qpredict, p = 0.975), col='lightblue')

Results

I’ve plotted both the ML and the Bayesian results below:

You’ll notice that I got what I wanted: the HDRs are wider away from the dense x regions.

Of course, this is a bit of a cheat.  I get this because of the way I map my x values to the features more than from a proper modelling of the certainty/uncertainty of the values in various x regions.  (With a proper kernel approach I would get it without cheating).

The cheat is obvious if you notice, on the bottom left plot, that the HDR width varies dramatically in the range from 60 to 100 where there really are no x values to cause this.

Worse still, because of the way I pick the features, the two points to the right are predicted exactly which of course is just overfitting.  The way the features combine, the densities with mean 100 and 150 are pretty much determined by these two points only and the information from the other points does not alleviate the overfitting.

Oh well, cheat or no cheat, it was a fun little exercise.

144-156=-12

Bayesian linear regression

Sunday, May 24th, 2009

I’m currently reading Bayesian Statistics: An introduction by Peter M. Lee, and today I’m reading chapter 6 on linear regression.

I decided to play around with one of the examples in R.

The model is just linear regression as we know and love it, that is we have explanatory variables x_i and dependent variables y_i where the latter are assumed normally distributed around a mean that depends linearly on x_i:

y_i \sim N(\alpha + \beta(x_i-\bar{x}), \phi)

where \alpha (intercept), \beta (slope), and \phi (variance) are model parameters.

Oh, \bar{x} is the mean of the x_i values, in case you are wondering, so the line is “centered” around \bar{x}.

The joint prior of the model parameters is the reference prior: p(\alpha,\beta,\phi)\propto 1/\phi and since we assume that the x_i values are always known and thus not stochastic, the posterior is

 p(\alpha,\beta,\phi\,|\,\mathbf{x},\mathbf{y}) \propto p(\alpha,\beta,\phi) p(\mathbf{y}\,|\,\mathbf{x},\alpha,\beta,\phi)

The slope

After a lot of arithmetic (buy the book if you want to check it) we get a marginal posterior distribution for the slope given by

\frac{\beta-b}{s/\sqrt{S_{xx}}} \sim \mathbf{t}_{n-2}

where n is the number of observations, b=S_{xy}/S_{xx}, S_{xx}=\sum(x_i-\bar{x})^2, S_{yy}=\sum(y_i-\bar{y})^2, S_{xy} = \sum (x_i-\bar{x})(y_i-\bar{y}), s^2=S_{ee}/(n-2), S_{ee}=S_{yy}-S_{xy}^2/S_{xx}, and \mathbf{t}_{n-2} is the t-distribution with n-2 degrees of freedom.

Phewh.

It’s easy enough to code up in R, though.  The function

qslope <- function(p) b + qt(p,n-2)*(s/sqrt(Sxx))

computes the quantile corresponding to the p percentile, i.e. the q satisfying P(\beta \leq q) = p.

There’s an example in the book about predicting the rainfall in December based on the rainfall in November.  The data looks like this (when I type it into R):

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year=year, nov=nov, dec=dec))

and the data points are plotted below:

In the book he just calculates the 50% high density region (HDR) for the slope for the entire data set, but I just tried calculating it for subsets (prefixes) to see how it changes as the data set grows.  I start with the three first years, since for n\leq 2 the division by n-2 is problematic.

The numbers I get are these:

n HDR.low      HDR.high
3 -1.408329    1.995358
4 -0.4224851   1.67113
5 -0.3280555   0.7114888
6 -0.4495798   0.4083625
7 -0.3618486   0.429319
8 -0.3159196   -0.1145006
9 -0.2189725   -0.03743869
10 -0.2454263  -0.07713291

or plottet:

So the HDR gets narrower as we get more data.  Nice.  And it seems that there is a negative correlation between rainfall in November and December, since the HDR is negative (below the dashed line in the plot).

A plain old linear model also sees a negative slope, but not significantly negative:

> summary(lm(dec ~ nov, rainfall))

Call:
lm(formula = dec ~ nov, data = rainfall)

Residuals:
    Min      1Q  Median      3Q     Max
-25.578  -8.542   3.682  10.231  14.628 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  50.1320     8.1620   6.142 0.000276 ***
nov          -0.1613     0.1191  -1.354 0.212773   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 13.86 on 8 degrees of freedom
Multiple R-squared: 0.1864,    Adjusted R-squared: 0.08472
F-statistic: 1.833 on 1 and 8 DF,  p-value: 0.2128

Whether the Bayesian analysis is “significant” depends on how you choose to do hypothesis testing; the HDR above just tells you that the 50% most likely slope is negative.  With a 95% HDR the interval contains zero:

> qslope(.025)
[1] -0.4359772
> qslope(.975)
[1] 0.1134179

so the interval is [-0.44,0.11].

Predictions

Given the observed data (\mathbf{x},\mathbf{y}) and a new predictor x_0 we can compute the density for the corresponding dependent variable.  From the model, of course, it must be y_0 \sim N(\gamma, \phi) with \gamma = \alpha+\beta(x_0-\bar{x}).  It is just that \gamma is unknown since \alpha and \beta are.

A bit of arithmetic, once again, gives us the following distribution for \gamma:

\gamma \sim N(a+b(x_0-\bar{x}), \phi\{1/n + (x_0-\bar{x})^2/S_{xx})

which becomes

\frac{\gamma-a-b(x_0-\bar{x})}{s\sqrt{1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qgamma <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1/n+(x0-mean.x)**2/Sxx))

in R, if we integrate \phi out.

Combining this \gamma with the normal distribution around it for y_0 just adds a little variance (one \phi to be exact) to the distribution

y_0 \sim N(\gamma, \phi\{1 + 1/n + (x_0-\bar{x})^2/S_{xx}\})

which, if we integrate over \phi, becomes

\frac{y_0-a-b(x_0-\bar{x})}{s\sqrt{1+1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qy0 <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1+1/n+(x0-mean.x)**2/Sxx))

in R.

Below, I’ve plotted the mean \gamma (black line), the 50% HDR for \gamma (blue lines) and the 50% HDR for y_0 (red lines) for the first 5 years and the full 10 years.

One thing that is obvious from the plots, but also if you look carefully at the expression for the distribution, is that the prediction distribution is more certain around the mean of the x_i values, \bar{x}.  It comes into the expression through the (x_0-\bar{x})^2 part of the variance.

I must admit that I am not particularly happy about that.

I mean, I can see that this is how the math goes, but I still don’t like it.

Since x_i is not stochastic we don’t really know how it should center around \bar{x}, so exactly how many observations do we expect to make around that point?  We cannot say!

Is there any particular reason that we should be more certain around \bar{x}?  Really only through how the model was constructed.  It is not something we get from the data.

I would much prefer if the variance was small in regions where we have observed a lot of x_i values and then larger in more sparse areas of the x axis.  Somehow that would capture the uncertainty in interpolating between areas where we have much information to areas where we have little.

This model is just not doing that, though.  It cannot, since the entire information we have about observed x_i values is captured only in their mean, \bar{x}.  It has to be that way from the model we use to related x_i to y_i.

This makes for nice simple mathematics, but maybe I would prefer something like a kernel method instead, for the reasons given above.  There, at least, the density of x_i values would have some impact on the certainty in my predictions.

143-154=-11