Ok, so I decided to play a bit more with Bayesian linear regression. I was aiming for something that would give me a smaller variance around $$x$$ values where I have a lot of samples, and since I could not get that with a “proper linear model” I hacked it about a bit.

I’m using the same data as before

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0, 17.6, 46.6, 57.0)
rainfall <- data.frame(cbind(year, nov, dec))

but a different model. One that looks a bit more like a kernel method (that is how I get the desired feature), but is still formulated as a linear regression model.

The notation and math is taken from Bishop *Pattern recognition and machine learning*.

### The model

The main difference is the way the relationship between the explanatory variables, $$x_i$$, and dependent variables, $$y_i$$, is modelled. The explanatory variables are not used directly, but mapped into a vector of “features”, $$\phi(x_i)$$ that is then used in the linear model $$y_i \sim N( \mathbf{w}^T \phi(x_i), 1/\beta)$$. For convenience we work with the precision $$\beta$$ rather than variance $$\sigma^2=1/\beta$$.

This is still a linear model, only it is linear in the weights we train, $$\mathbf{w}$$, and in the features $$\phi(x_i)$$, but *not* in the explanatory variables $$x_i$$.

I’ve used five features: constant 1 to get an intercept, and four normal densities, all with the same variance but with means in 0, 50, 100, and 150, respectively. That covers the range of $$x_i$$ values nicely, so that should give me an okay feature set.

Implemented in R, the $$\phi(x)$$ function is then:

phi <- function(x)

c(1, dnorm(x, 0, 10), dnorm(x,50,10), dnorm(x,100,10), dnorm(x,150,10))

I guess there is no point in trying to interpret this model. For the linear model yesterday, we can use the parameters to reason about the underlying system. Not so for this arbitrary set of features. This is a model that is only aimed at predicting new dependent variables; it is not trying to model some underlying physical reality we can reason about from the model.

### Maximum likelihood predictions

Let $$\Phi$$ be the matrix with a row per $$(x_i,y_i)$$ in our data set, where row $$i$$ is $$\phi(x_i)$$.

Phi <- t(sapply(rain$nov, phi))

The maximum likelihood estimate of the weights is given by $$\mathbf{w}_{\mathrm{ML}} = \left(\Phi^T\Phi\right)^{-1}\Phi^T\mathbf{y}$$ and the maximum likelihood of the precision is given by $$1/\beta_{\mathrm{ML}}=\frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{w}_{\mathrm{ML}}\phi(x_i))^2$$.

wML <- solve(t(Phi)%*%Phi) %*% t(Phi) %*% rain$dec
predictions <- sapply(rain$nov, function(x) t(wML) %*% phi(x))
beta = 1/(sum( (rain$dec - predictions)**2 )/N)

Given a new $$x$$ value, calculating the best predictions for the corresponding $$y$$ value or the 95% HDR is straight forward.

xs <- seq(0, 150, length.out=100)
ys <- sapply(xs, function(x) t(wML) %*% phi(x))
lines(xs, ys, col='darkgreen')
lines(xs, qnorm(0.025, ys, sqrt(1/beta)), col='green')
lines(xs, qnorm(0.975, ys, sqrt(1/beta)), col='green')

Below I’ve plotted this for five of the data points and the full data with 10 data points. (There’s some numerical problems when I use the first five data points, so I’ve used the last 5, but you get the idea).

Notice that here the width of the HDR does *not* depend on the $$x$$ value. The variance is estimated from the entire data set and just used in each position.

### Bayesian predictions

For the Bayesian approach I’ll assume that the precision is know (and equal to the $$\beta$$ I get from the maximum likelihood estimate). This is just because it makes it easier, since then I have all the equations I need from my text book. Otherwise I’d have to integrate it out and get some t density instead of the normal, but I won’t bother.

With that assumption, I only need to worry about the weights, which will get the conjugate prior $$p(\mathbf{w}) \sim N(\mathbf{m}_0, \mathbf{S}_0)$$.

I use a prior mean of zero, $$\mathbf{m}_0=\mathbf{0}$$, and the covariance matrix $$\mathbf{S}_0=\alpha^{-1}\mathbf{I}$$.

m0 <- c(0,0,0,0,0)
alpha <- 1e-10
S0 <- diag(1/alpha,5)

So $$\alpha$$ is the precision of the prior. The smaller the precision, the less influence the prior has on the estimate and for $$\alpha\to 0$$ the maximum posterior estimate will tend towards the maximum likelihood estimate. I’ve picked a pretty small $$\alpha$$ to get a prediction close to the ML estimate, since I am mainly interested in the HDRs here and not the point estimates. You can play around with it yourself to see how $$\alpha$$ affects the predictions.

The posterior distribution of the weights is $$p(\mathbf{w}\,|\,\mathbf{x},\mathbf{y}) = N(\mathbf{m}_N,\mathbf{S}_N)$$ where

$$!\mathbf{S}^{-1}_N = \mathbf{S}^{-1}_0 + \beta\Phi^T\Phi$$

$$!\mathbf{m}_N = \mathbf{S}_N\left(\mathbf{S}^{-1}_0\mathbf{m}_0 + \beta\Phi^T\mathbf{y}\right)$$

SN <- solve(solve(S0) + beta * t(Phi)%*%Phi)
mN <-SN %*% (solve(S0) %*% m0 + beta * t(Phi) %*% rain$dec)

This can be simplified a bit with the prior I’ve used, but let’s just use the general equations here.

For a new explanatory value, $$x$$, the corresponding dependent variable will be distributed as $$y \sim N(\mathbf{m}_N^T\phi(x), \sigma_N^2(x))$$ where $$\sigma^2_N=\frac{1}{\beta}+\phi(x)^T\mathbf{S}_N\phi(x)$$, and from this we can get the maximum posterior prediction (the mode of the posterior, which is the mean for the normal distribution) and the 95% HDR

predict.mean <- function(x) t(mN) %*% phi(x)
predict.sd <- function(x) sqrt(1/beta + t(phi(x)) %*% SN %*% phi(x))
qpredict <- function(x,p) qnorm(p, predict.mean(x), predict.sd(x))
lines(xs, sapply(xs, predict.mean), col='blue')
lines(xs, sapply(xs, qpredict, p = 0.025), col='lightblue')
lines(xs, sapply(xs, qpredict, p = 0.975), col='lightblue')

### Results

I’ve plotted both the ML and the Bayesian results below:

You’ll notice that I got what I wanted: the HDRs are wider away from the dense $$x$$ regions.

Of course, this is a bit of a cheat. I get this because of the way I map my $$x$$ values to the features more than from a proper modelling of the certainty/uncertainty of the values in various $$x$$ regions. (With a proper kernel approach I would get it without cheating).

The cheat is obvious if you notice, on the bottom left plot, that the HDR width varies dramatically in the range from 60 to 100 where there really are no $$x$$ values to cause this.

Worse still, because of the way I pick the features, the two points to the right are predicted exactly which of course is just overfitting. The way the features combine, the densities with mean 100 and 150 are pretty much determined by these two points only and the information from the other points does not alleviate the overfitting.

Oh well, cheat or no cheat, it was a fun little exercise.

—

144-156=-12