Bayesian linear regression
Sunday, May 24th, 2009I’m currently reading Bayesian Statistics: An introduction by Peter M. Lee, and today I’m reading chapter 6 on linear regression.
I decided to play around with one of the examples in R.
The model is just linear regression as we know and love it, that is we have explanatory variables
and dependent variables
where the latter are assumed normally distributed around a mean that depends linearly on
:

where
(intercept),
(slope), and
(variance) are model parameters.
Oh,
is the mean of the
values, in case you are wondering, so the line is “centered” around
.
The joint prior of the model parameters is the reference prior:
and since we assume that the
values are always known and thus not stochastic, the posterior is

The slope
After a lot of arithmetic (buy the book if you want to check it) we get a marginal posterior distribution for the slope given by

where
is the number of observations,
,
,
,
,
,
, and
is the t-distribution with
degrees of freedom.
Phewh.
It’s easy enough to code up in R, though. The function
qslope <- function(p) b + qt(p,n-2)*(s/sqrt(Sxx))
computes the quantile corresponding to the
percentile, i.e. the
satisfying
.
There’s an example in the book about predicting the rainfall in December based on the rainfall in November. The data looks like this (when I type it into R):
year <- 1971:1980 nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7) dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0, 17.6, 46.6, 57.0) rainfall <- data.frame(cbind(year=year, nov=nov, dec=dec))
and the data points are plotted below:

In the book he just calculates the 50% high density region (HDR) for the slope for the entire data set, but I just tried calculating it for subsets (prefixes) to see how it changes as the data set grows. I start with the three first years, since for
the division by
is problematic.
The numbers I get are these:
n HDR.low HDR.high 3 -1.408329 1.995358 4 -0.4224851 1.67113 5 -0.3280555 0.7114888 6 -0.4495798 0.4083625 7 -0.3618486 0.429319 8 -0.3159196 -0.1145006 9 -0.2189725 -0.03743869 10 -0.2454263 -0.07713291
or plottet:

So the HDR gets narrower as we get more data. Nice. And it seems that there is a negative correlation between rainfall in November and December, since the HDR is negative (below the dashed line in the plot).
A plain old linear model also sees a negative slope, but not significantly negative:
> summary(lm(dec ~ nov, rainfall)) Call: lm(formula = dec ~ nov, data = rainfall) Residuals: Min 1Q Median 3Q Max -25.578 -8.542 3.682 10.231 14.628 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 50.1320 8.1620 6.142 0.000276 *** nov -0.1613 0.1191 -1.354 0.212773 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.86 on 8 degrees of freedom Multiple R-squared: 0.1864, Adjusted R-squared: 0.08472 F-statistic: 1.833 on 1 and 8 DF, p-value: 0.2128
Whether the Bayesian analysis is “significant” depends on how you choose to do hypothesis testing; the HDR above just tells you that the 50% most likely slope is negative. With a 95% HDR the interval contains zero:
> qslope(.025) [1] -0.4359772 > qslope(.975) [1] 0.1134179
so the interval is [-0.44,0.11].
Predictions
Given the observed data
and a new predictor
we can compute the density for the corresponding dependent variable. From the model, of course, it must be
with
. It is just that
is unknown since
and
are.
A bit of arithmetic, once again, gives us the following distribution for
:

which becomes

or
qgamma <- function(x0, p=0.5)
(a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1/n+(x0-mean.x)**2/Sxx))
in R, if we integrate
out.
Combining this
with the normal distribution around it for
just adds a little variance (one
to be exact) to the distribution

which, if we integrate over
, becomes

or
qy0 <- function(x0, p=0.5)
(a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1+1/n+(x0-mean.x)**2/Sxx))
in R.
Below, I’ve plotted the mean
(black line), the 50% HDR for
(blue lines) and the 50% HDR for
(red lines) for the first 5 years and the full 10 years.


One thing that is obvious from the plots, but also if you look carefully at the expression for the distribution, is that the prediction distribution is more certain around the mean of the
values,
. It comes into the expression through the
part of the variance.
I must admit that I am not particularly happy about that.
I mean, I can see that this is how the math goes, but I still don’t like it.
Since
is not stochastic we don’t really know how it should center around
, so exactly how many observations do we expect to make around that point? We cannot say!
Is there any particular reason that we should be more certain around
? Really only through how the model was constructed. It is not something we get from the data.
I would much prefer if the variance was small in regions where we have observed a lot of
values and then larger in more sparse areas of the
axis. Somehow that would capture the uncertainty in interpolating between areas where we have much information to areas where we have little.
This model is just not doing that, though. It cannot, since the entire information we have about observed
values is captured only in their mean,
. It has to be that way from the model we use to related
to
.
This makes for nice simple mathematics, but maybe I would prefer something like a kernel method instead, for the reasons given above. There, at least, the density of
values would have some impact on the certainty in my predictions.
–
143-154=-11