Archive for May, 2009

Bayesian linear regression

Sunday, May 24th, 2009

I’m currently reading Bayesian Statistics: An introduction by Peter M. Lee, and today I’m reading chapter 6 on linear regression.

I decided to play around with one of the examples in R.

The model is just linear regression as we know and love it, that is we have explanatory variables x_i and dependent variables y_i where the latter are assumed normally distributed around a mean that depends linearly on x_i:

y_i \sim N(\alpha + \beta(x_i-\bar{x}), \phi)

where \alpha (intercept), \beta (slope), and \phi (variance) are model parameters.

Oh, \bar{x} is the mean of the x_i values, in case you are wondering, so the line is “centered” around \bar{x}.

The joint prior of the model parameters is the reference prior: p(\alpha,\beta,\phi)\propto 1/\phi and since we assume that the x_i values are always known and thus not stochastic, the posterior is

 p(\alpha,\beta,\phi\,|\,\mathbf{x},\mathbf{y}) \propto p(\alpha,\beta,\phi) p(\mathbf{y}\,|\,\mathbf{x},\alpha,\beta,\phi)

The slope

After a lot of arithmetic (buy the book if you want to check it) we get a marginal posterior distribution for the slope given by

\frac{\beta-b}{s/\sqrt{S_{xx}}} \sim \mathbf{t}_{n-2}

where n is the number of observations, b=S_{xy}/S_{xx}, S_{xx}=\sum(x_i-\bar{x})^2, S_{yy}=\sum(y_i-\bar{y})^2, S_{xy} = \sum (x_i-\bar{x})(y_i-\bar{y}), s^2=S_{ee}/(n-2), S_{ee}=S_{yy}-S_{xy}^2/S_{xx}, and \mathbf{t}_{n-2} is the t-distribution with n-2 degrees of freedom.

Phewh.

It’s easy enough to code up in R, though.  The function

qslope <- function(p) b + qt(p,n-2)*(s/sqrt(Sxx))

computes the quantile corresponding to the p percentile, i.e. the q satisfying P(\beta \leq q) = p.

There’s an example in the book about predicting the rainfall in December based on the rainfall in November.  The data looks like this (when I type it into R):

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year=year, nov=nov, dec=dec))

and the data points are plotted below:

In the book he just calculates the 50% high density region (HDR) for the slope for the entire data set, but I just tried calculating it for subsets (prefixes) to see how it changes as the data set grows.  I start with the three first years, since for n\leq 2 the division by n-2 is problematic.

The numbers I get are these:

n HDR.low      HDR.high
3 -1.408329    1.995358
4 -0.4224851   1.67113
5 -0.3280555   0.7114888
6 -0.4495798   0.4083625
7 -0.3618486   0.429319
8 -0.3159196   -0.1145006
9 -0.2189725   -0.03743869
10 -0.2454263  -0.07713291

or plottet:

So the HDR gets narrower as we get more data.  Nice.  And it seems that there is a negative correlation between rainfall in November and December, since the HDR is negative (below the dashed line in the plot).

A plain old linear model also sees a negative slope, but not significantly negative:

> summary(lm(dec ~ nov, rainfall))

Call:
lm(formula = dec ~ nov, data = rainfall)

Residuals:
    Min      1Q  Median      3Q     Max
-25.578  -8.542   3.682  10.231  14.628 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  50.1320     8.1620   6.142 0.000276 ***
nov          -0.1613     0.1191  -1.354 0.212773   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 13.86 on 8 degrees of freedom
Multiple R-squared: 0.1864,    Adjusted R-squared: 0.08472
F-statistic: 1.833 on 1 and 8 DF,  p-value: 0.2128

Whether the Bayesian analysis is “significant” depends on how you choose to do hypothesis testing; the HDR above just tells you that the 50% most likely slope is negative.  With a 95% HDR the interval contains zero:

> qslope(.025)
[1] -0.4359772
> qslope(.975)
[1] 0.1134179

so the interval is [-0.44,0.11].

Predictions

Given the observed data (\mathbf{x},\mathbf{y}) and a new predictor x_0 we can compute the density for the corresponding dependent variable.  From the model, of course, it must be y_0 \sim N(\gamma, \phi) with \gamma = \alpha+\beta(x_0-\bar{x}).  It is just that \gamma is unknown since \alpha and \beta are.

A bit of arithmetic, once again, gives us the following distribution for \gamma:

\gamma \sim N(a+b(x_0-\bar{x}), \phi\{1/n + (x_0-\bar{x})^2/S_{xx})

which becomes

\frac{\gamma-a-b(x_0-\bar{x})}{s\sqrt{1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qgamma <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1/n+(x0-mean.x)**2/Sxx))

in R, if we integrate \phi out.

Combining this \gamma with the normal distribution around it for y_0 just adds a little variance (one \phi to be exact) to the distribution

y_0 \sim N(\gamma, \phi\{1 + 1/n + (x_0-\bar{x})^2/S_{xx}\})

which, if we integrate over \phi, becomes

\frac{y_0-a-b(x_0-\bar{x})}{s\sqrt{1+1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qy0 <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1+1/n+(x0-mean.x)**2/Sxx))

in R.

Below, I’ve plotted the mean \gamma (black line), the 50% HDR for \gamma (blue lines) and the 50% HDR for y_0 (red lines) for the first 5 years and the full 10 years.

One thing that is obvious from the plots, but also if you look carefully at the expression for the distribution, is that the prediction distribution is more certain around the mean of the x_i values, \bar{x}.  It comes into the expression through the (x_0-\bar{x})^2 part of the variance.

I must admit that I am not particularly happy about that.

I mean, I can see that this is how the math goes, but I still don’t like it.

Since x_i is not stochastic we don’t really know how it should center around \bar{x}, so exactly how many observations do we expect to make around that point?  We cannot say!

Is there any particular reason that we should be more certain around \bar{x}?  Really only through how the model was constructed.  It is not something we get from the data.

I would much prefer if the variance was small in regions where we have observed a lot of x_i values and then larger in more sparse areas of the x axis.  Somehow that would capture the uncertainty in interpolating between areas where we have much information to areas where we have little.

This model is just not doing that, though.  It cannot, since the entire information we have about observed x_i values is captured only in their mean, \bar{x}.  It has to be that way from the model we use to related x_i to y_i.

This makes for nice simple mathematics, but maybe I would prefer something like a kernel method instead, for the reasons given above.  There, at least, the density of x_i values would have some impact on the certainty in my predictions.

143-154=-11

Not mistakes, just not idiomatic!

Sunday, May 24th, 2009

Here’s a list of “common Python mistakes”:

if a == None:
Really common. None is a singleton, so you can (well, actually, have) to compare to it like ‘if a is None‘ or ‘if a is not None‘.

if (condition):
This is not really pythonic, if doesn’t take () around the condition in python, () purpose is grouping, you can use them in an if but only to clarify a really complex condition or similar stuff. Please don’t clutter code with unneeded characters.

if len(list):
This is a misunderstanding of Python way of things, any empty sequence (string, list, tuple) is False. So if you want to make sure a list is empty, just check if it’s False. Same for tuples, and in some ocassions for strings.

if type(obj) is type(1):
This is killing kittens, there’s a builtin function for this purpose, it’s called -surprise- instance(object, type). For example instance(123, int)

Or are they really mistakes?

They will all work exactly as intended, and essentially with the same runtime complexity.

The only problem is that they are not the idiomatic way of doing things.

Of course, you should follow the language conventions if you want other people to be able to read your code without continuously wondering why you are doing this or that.  Anything that breaks conventions have other programmers stop and wonder what is going on.

Still, I wouldn’t call these mistakes as such.  I’d prefer to reserve that word for constructions that does not have the desired effect — like a = None instead of a == None which at least will give you a syntax error in Python but wouldn’t in C — or does the right thing but at a significant runtime cost — like for (int i = 0; i < strlen(s); ++i) that gives you a quadratic rather than linear running time.

143-153=-10

Problems with boost::interval

Saturday, May 23rd, 2009

I need to work with exponentiation and logarithms with boost interval arithmetic, but nothing seems to work for me.

Here’s a simple little test program:

#include <iostream>
#include <boost/numeric/interval.hpp>
using namespace boost::numeric;
using namespace interval_lib;

int main()
{
  interval<double> a = 1, b = 1;
  std::cout << width(exp(a + b)) << std::endl;
  return 0;
}

If I try to compile this,I get the errors:

/usr/include/boost/numeric/interval/transc.hpp: In function ‘boost::numeric::interval<T, Policies> boost::numeric::exp(const boost::numeric::interval<T, Policies>&) [with T = double, Policies = boost::numeric::interval_lib::policies<boost::numeric::interval_lib::rounded_math<double>, boost::numeric::interval_lib::checking_strict<double> >]‘:
foo.cc:9:   instantiated from here
/usr/include/boost/numeric/interval/transc.hpp:34: error: ‘struct boost::numeric::interval_lib::rounded_math<double>’ has no member named ‘exp_down’
/usr/include/boost/numeric/interval/transc.hpp:34: error: ‘struct boost::numeric::interval_lib::rounded_math<double>’ has no member named ‘exp_up’

Why aren’t exp_down and exp_up defined?  According to the documentation, they should be defined for the built in types:

By default, it is interval_lib::rounded_math<T>. The class interval_lib::rounded_math is already specialized for the standard floating types (float , double and long double). So if the base type of your intervals is not one of these, a good solution would probably be to provide a specialization of this class. But if the default specialization of rounded_math<T> for float, double, or long double is not what you seek, or you do not want to specialize interval_lib::rounded_math<T> (say because you prefer to work in your own namespace) you can also define your own rounding policy and pass it directly to interval_lib::policies.

This is really annoying me, ’cause I have no idea what I am doing wrong or how to fix it.  No idea whatsoever, and I cannot continue with my project before figuring it out.

Sometimes, the boost library is just a little too complex for me, I guess…

142-152=-10

Danes and evolution

Saturday, May 23rd, 2009

Hat tip Framing Science.

142-151=-9

First Danish astronaut

Wednesday, May 20th, 2009

Today Denmark got its first astronaut.

The European Space Agency (ESA) has chosen Andreas Mogensen as the first Danish astronaut to join the European space programme.

ESA launched its search for a new corps of astronauts in May last year and received 8413 applicants, including 35 from Denmark. Of the thousands of eager applicants, six were chosen, including 32-year-old Mogensen.

Congratulations to him!

139-150=-11