Playing some more with Bayesian linear regression
Monday, May 25th, 2009Ok, so I decided to play a bit more with Bayesian linear regression. I was aiming for something that would give me a smaller variance around
values where I have a lot of samples, and since I could not get that with a “proper linear model” I hacked it about a bit.
I’m using the same data as before
year <- 1971:1980 nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7) dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0, 17.6, 46.6, 57.0) rainfall <- data.frame(cbind(year, nov, dec))

but a different model. One that looks a bit more like a kernel method (that is how I get the desired feature), but is still formulated as a linear regression model.
The notation and math is taken from Bishop Pattern recognition and machine learning.
The model
The main difference is the way the relationship between the explanatory variables,
, and dependent variables,
, is modelled. The explanatory variables are not used directly, but mapped into a vector of “features”,
that is then used in the linear model
. For convenience we work with the precision
rather than variance
.
This is still a linear model, only it is linear in the weights we train,
, and in the features
, but not in the explanatory variables
.
I’ve used five features: constant 1 to get an intercept, and four normal densities, all with the same variance but with means in 0, 50, 100, and 150, respectively. That covers the range of
values nicely, so that should give me an okay feature set.
Implemented in R, the
function is then:
phi <- function(x)
c(1, dnorm(x, 0, 10), dnorm(x,50,10), dnorm(x,100,10), dnorm(x,150,10))
I guess there is no point in trying to interpret this model. For the linear model yesterday, we can use the parameters to reason about the underlying system. Not so for this arbitrary set of features. This is a model that is only aimed at predicting new dependent variables; it is not trying to model some underlying physical reality we can reason about from the model.
Maximum likelihood predictions
Let
be the matrix with a row per
in our data set, where row
is
.
Phi <- t(sapply(rain$nov, phi))
The maximum likelihood estimate of the weights is given by
and the maximum likelihood of the precision is given by
.
wML <- solve(t(Phi)%*%Phi) %*% t(Phi) %*% rain$dec predictions <- sapply(rain$nov, function(x) t(wML) %*% phi(x)) beta = 1/(sum( (rain$dec - predictions)**2 )/N)
Given a new
value, calculating the best predictions for the corresponding
value or the 95% HDR is straight forward.
xs <- seq(0, 150, length.out=100) ys <- sapply(xs, function(x) t(wML) %*% phi(x)) lines(xs, ys, col='darkgreen') lines(xs, qnorm(0.025, ys, sqrt(1/beta)), col='green') lines(xs, qnorm(0.975, ys, sqrt(1/beta)), col='green')
Below I’ve plotted this for five of the data points and the full data with 10 data points. (There’s some numerical problems when I use the first five data points, so I’ve used the last 5, but you get the idea).
Notice that here the width of the HDR does not depend on the
value. The variance is estimated from the entire data set and just used in each position.
Bayesian predictions
For the Bayesian approach I’ll assume that the precision is know (and equal to the
I get from the maximum likelihood estimate). This is just because it makes it easier, since then I have all the equations I need from my text book. Otherwise I’d have to integrate it out and get some t density instead of the normal, but I won’t bother.
With that assumption, I only need to worry about the weights, which will get the conjugate prior
.
I use a prior mean of zero,
, and the covariance matrix
.
m0 <- c(0,0,0,0,0) alpha <- 1e-10 S0 <- diag(1/alpha,5)
So
is the precision of the prior. The smaller the precision, the less influence the prior has on the estimate and for
the maximum posterior estimate will tend towards the maximum likelihood estimate. I’ve picked a pretty small
to get a prediction close to the ML estimate, since I am mainly interested in the HDRs here and not the point estimates. You can play around with it yourself to see how
affects the predictions.
The posterior distribution of the weights is
where


SN <- solve(solve(S0) + beta * t(Phi)%*%Phi) mN <-SN %*% (solve(S0) %*% m0 + beta * t(Phi) %*% rain$dec)
This can be simplified a bit with the prior I’ve used, but let’s just use the general equations here.
For a new explanatory value,
, the corresponding dependent variable will be distributed as
where
, and from this we can get the maximum posterior prediction (the mode of the posterior, which is the mean for the normal distribution) and the 95% HDR
predict.mean <- function(x) t(mN) %*% phi(x) predict.sd <- function(x) sqrt(1/beta + t(phi(x)) %*% SN %*% phi(x)) qpredict <- function(x,p) qnorm(p, predict.mean(x), predict.sd(x)) lines(xs, sapply(xs, predict.mean), col='blue') lines(xs, sapply(xs, qpredict, p = 0.025), col='lightblue') lines(xs, sapply(xs, qpredict, p = 0.975), col='lightblue')
Results
I’ve plotted both the ML and the Bayesian results below:

You’ll notice that I got what I wanted: the HDRs are wider away from the dense
regions.
Of course, this is a bit of a cheat. I get this because of the way I map my
values to the features more than from a proper modelling of the certainty/uncertainty of the values in various
regions. (With a proper kernel approach I would get it without cheating).
The cheat is obvious if you notice, on the bottom left plot, that the HDR width varies dramatically in the range from 60 to 100 where there really are no
values to cause this.
Worse still, because of the way I pick the features, the two points to the right are predicted exactly which of course is just overfitting. The way the features combine, the densities with mean 100 and 150 are pretty much determined by these two points only and the information from the other points does not alleviate the overfitting.
Oh well, cheat or no cheat, it was a fun little exercise.
–
144-156=-12