Bayesian linear regression

I’m currently reading Bayesian Statistics: An introduction by Peter M. Lee, and today I’m reading chapter 6 on linear regression.

I decided to play around with one of the examples in R.

The model is just linear regression as we know and love it, that is we have explanatory variables x_i and dependent variables y_i where the latter are assumed normally distributed around a mean that depends linearly on x_i:

y_i \sim N(\alpha + \beta(x_i-\bar{x}), \phi)

where \alpha (intercept), \beta (slope), and \phi (variance) are model parameters.

Oh, \bar{x} is the mean of the x_i values, in case you are wondering, so the line is “centered” around \bar{x}.

The joint prior of the model parameters is the reference prior: p(\alpha,\beta,\phi)\propto 1/\phi and since we assume that the x_i values are always known and thus not stochastic, the posterior is

 p(\alpha,\beta,\phi\,|\,\mathbf{x},\mathbf{y}) \propto p(\alpha,\beta,\phi) p(\mathbf{y}\,|\,\mathbf{x},\alpha,\beta,\phi)

The slope

After a lot of arithmetic (buy the book if you want to check it) we get a marginal posterior distribution for the slope given by

\frac{\beta-b}{s/\sqrt{S_{xx}}} \sim \mathbf{t}_{n-2}

where n is the number of observations, b=S_{xy}/S_{xx}, S_{xx}=\sum(x_i-\bar{x})^2, S_{yy}=\sum(y_i-\bar{y})^2, S_{xy} = \sum (x_i-\bar{x})(y_i-\bar{y}), s^2=S_{ee}/(n-2), S_{ee}=S_{yy}-S_{xy}^2/S_{xx}, and \mathbf{t}_{n-2} is the t-distribution with n-2 degrees of freedom.

Phewh.

It’s easy enough to code up in R, though.  The function

qslope <- function(p) b + qt(p,n-2)*(s/sqrt(Sxx))

computes the quantile corresponding to the p percentile, i.e. the q satisfying P(\beta \leq q) = p.

There’s an example in the book about predicting the rainfall in December based on the rainfall in November.  The data looks like this (when I type it into R):

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year=year, nov=nov, dec=dec))

and the data points are plotted below:

In the book he just calculates the 50% high density region (HDR) for the slope for the entire data set, but I just tried calculating it for subsets (prefixes) to see how it changes as the data set grows.  I start with the three first years, since for n\leq 2 the division by n-2 is problematic.

The numbers I get are these:

n HDR.low      HDR.high
3 -1.408329    1.995358
4 -0.4224851   1.67113
5 -0.3280555   0.7114888
6 -0.4495798   0.4083625
7 -0.3618486   0.429319
8 -0.3159196   -0.1145006
9 -0.2189725   -0.03743869
10 -0.2454263  -0.07713291

or plottet:

So the HDR gets narrower as we get more data.  Nice.  And it seems that there is a negative correlation between rainfall in November and December, since the HDR is negative (below the dashed line in the plot).

A plain old linear model also sees a negative slope, but not significantly negative:

> summary(lm(dec ~ nov, rainfall))

Call:
lm(formula = dec ~ nov, data = rainfall)

Residuals:
    Min      1Q  Median      3Q     Max
-25.578  -8.542   3.682  10.231  14.628 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  50.1320     8.1620   6.142 0.000276 ***
nov          -0.1613     0.1191  -1.354 0.212773   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 13.86 on 8 degrees of freedom
Multiple R-squared: 0.1864,    Adjusted R-squared: 0.08472
F-statistic: 1.833 on 1 and 8 DF,  p-value: 0.2128

Whether the Bayesian analysis is “significant” depends on how you choose to do hypothesis testing; the HDR above just tells you that the 50% most likely slope is negative.  With a 95% HDR the interval contains zero:

> qslope(.025)
[1] -0.4359772
> qslope(.975)
[1] 0.1134179

so the interval is [-0.44,0.11].

Predictions

Given the observed data (\mathbf{x},\mathbf{y}) and a new predictor x_0 we can compute the density for the corresponding dependent variable.  From the model, of course, it must be y_0 \sim N(\gamma, \phi) with \gamma = \alpha+\beta(x_0-\bar{x}).  It is just that \gamma is unknown since \alpha and \beta are.

A bit of arithmetic, once again, gives us the following distribution for \gamma:

\gamma \sim N(a+b(x_0-\bar{x}), \phi\{1/n + (x_0-\bar{x})^2/S_{xx})

which becomes

\frac{\gamma-a-b(x_0-\bar{x})}{s\sqrt{1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qgamma <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1/n+(x0-mean.x)**2/Sxx))

in R, if we integrate \phi out.

Combining this \gamma with the normal distribution around it for y_0 just adds a little variance (one \phi to be exact) to the distribution

y_0 \sim N(\gamma, \phi\{1 + 1/n + (x_0-\bar{x})^2/S_{xx}\})

which, if we integrate over \phi, becomes

\frac{y_0-a-b(x_0-\bar{x})}{s\sqrt{1+1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qy0 <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1+1/n+(x0-mean.x)**2/Sxx))

in R.

Below, I’ve plotted the mean \gamma (black line), the 50% HDR for \gamma (blue lines) and the 50% HDR for y_0 (red lines) for the first 5 years and the full 10 years.

One thing that is obvious from the plots, but also if you look carefully at the expression for the distribution, is that the prediction distribution is more certain around the mean of the x_i values, \bar{x}.  It comes into the expression through the (x_0-\bar{x})^2 part of the variance.

I must admit that I am not particularly happy about that.

I mean, I can see that this is how the math goes, but I still don’t like it.

Since x_i is not stochastic we don’t really know how it should center around \bar{x}, so exactly how many observations do we expect to make around that point?  We cannot say!

Is there any particular reason that we should be more certain around \bar{x}?  Really only through how the model was constructed.  It is not something we get from the data.

I would much prefer if the variance was small in regions where we have observed a lot of x_i values and then larger in more sparse areas of the x axis.  Somehow that would capture the uncertainty in interpolating between areas where we have much information to areas where we have little.

This model is just not doing that, though.  It cannot, since the entire information we have about observed x_i values is captured only in their mean, \bar{x}.  It has to be that way from the model we use to related x_i to y_i.

This makes for nice simple mathematics, but maybe I would prefer something like a kernel method instead, for the reasons given above.  There, at least, the density of x_i values would have some impact on the certainty in my predictions.

143-154=-11

Tags: , ,

3 Responses to “Bayesian linear regression”

  1. Egon Willighagen Says:

    Many linear regression methods share the feature that the are more accurate around the center of the model. OLS has this too. SVR basically does the same, but the model focuses not on a single line (which just is most accurate around the mean), but focuses on the support vectors.

    Back the linear regression… think about it like this: the variance in the model is due primary due to changes in the slope and intercept; make random small changes and average that. You’ll notice that they all go through the same small region in the middle, hence being more accurate there.

    Alternatively, consider that estimating the middle of the model takes advantage of all observed points, while the slope depends on the difference between the left and right group of points, each with less points, hence less accurate.

    Sorry, for the lack of math, but if I remember correctly, you can back up these statements with hard-core statistics, but not so fluent in that.

    SVR really does the same, but has a different concept of model center.

  2. Thomas Mailund Says:

    Thanks Egon, your point makes absolutely sense to me. I hadn’t thought about it that way.

  3. Mailund on the Internet » Blog Archive » Playing some more with Bayesian linear regression Says:

    [...] so I decided to play a bit more with Bayesian linear regression.  I was aiming for something that would give me a smaller variance around values where I have a [...]

Leave a Reply