Posts Tagged ‘bayesian statistics’

Playing some more with Bayesian linear regression

Monday, May 25th, 2009

Ok, so I decided to play a bit more with Bayesian linear regression.  I was aiming for something that would give me a smaller variance around x values where I have a lot of samples, and since I could not get that with a “proper linear model” I hacked it about a bit.

I’m using the same data as before

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year, nov, dec))

but a different model.  One that looks a bit more like a kernel method (that is how I get the desired feature), but is still formulated as a linear regression model.

The notation and math is taken from Bishop Pattern recognition and machine learning.

The model

The main difference is the way the relationship between the explanatory variables, x_i, and dependent variables, y_i, is modelled.  The explanatory variables are not used directly, but mapped into a vector of “features”, \phi(x_i) that is then used in the linear model y_i \sim N( \mathbf{w}^T \phi(x_i), 1/\beta).  For convenience we work with the precision \beta rather than variance \sigma^2=1/\beta.

This is still a linear model, only it is linear in the weights we train, \mathbf{w}, and in the features \phi(x_i), but not in the explanatory variables x_i.

I’ve used five features: constant 1 to get an intercept, and four normal densities, all with the same variance but with means in 0, 50, 100, and 150, respectively.  That covers the range of x_i values nicely, so that should give me an okay feature set.

Implemented in R, the \phi(x) function is then:

phi <- function(x)
     c(1, dnorm(x, 0, 10), dnorm(x,50,10), dnorm(x,100,10), dnorm(x,150,10))

I guess there is no point in trying to interpret this model.  For the linear model yesterday, we can use the parameters to reason about the underlying system.  Not so for this arbitrary set of features.  This is a model that is only aimed at predicting new dependent variables; it is not trying to model some underlying physical reality we can reason about from the model.

Maximum likelihood predictions

Let \Phi be the matrix with a row per (x_i,y_i) in our data set, where row i is \phi(x_i).

Phi <- t(sapply(rain$nov, phi))

The maximum likelihood estimate of the weights is given by \mathbf{w}_{\mathrm{ML}} = \left(\Phi^T\Phi\right)^{-1}\Phi^T\mathbf{y} and the maximum likelihood of the precision is given by 1/\beta_{\mathrm{ML}}=\frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{w}_{\mathrm{ML}}\phi(x_i))^2.

wML <- solve(t(Phi)%*%Phi) %*% t(Phi) %*% rain$dec
predictions <- sapply(rain$nov, function(x) t(wML) %*% phi(x))
beta = 1/(sum( (rain$dec - predictions)**2 )/N)

Given a new x value, calculating the best predictions for the corresponding y value or the 95% HDR is straight forward.

xs <- seq(0, 150, length.out=100)
ys <- sapply(xs, function(x) t(wML) %*% phi(x))
lines(xs, ys, col='darkgreen')
lines(xs, qnorm(0.025, ys, sqrt(1/beta)), col='green')
lines(xs, qnorm(0.975, ys, sqrt(1/beta)), col='green')

Below I’ve plotted this for five of the data points and the full data with 10 data points.  (There’s some numerical problems when I use the first five data points, so I’ve used the last 5, but you get the idea).

Notice that here the width of the HDR does not depend on the x value.  The variance is estimated from the entire data set and just used in each position.

Bayesian predictions

For the Bayesian approach I’ll assume that the precision is know (and equal to the \beta I get from the maximum likelihood estimate).  This is just because it makes it easier, since then I have all the equations I need from my text book.  Otherwise I’d have to integrate it out and get some t density instead of the normal, but I won’t bother.

With that assumption, I only need to worry about the weights, which will get the conjugate prior p(\mathbf{w}) \sim N(\mathbf{m}_0, \mathbf{S}_0).

I use a prior mean of zero, \mathbf{m}_0=\mathbf{0}, and the covariance matrix \mathbf{S}_0=\alpha^{-1}\mathbf{I}.

m0 <- c(0,0,0,0,0)
alpha <- 1e-10
S0 <- diag(1/alpha,5)

So \alpha is the precision of the prior.  The smaller the precision, the less influence the prior has on the estimate and for \alpha\to 0 the maximum posterior estimate will tend towards the maximum likelihood estimate.  I’ve picked a pretty small \alpha to get a prediction close to the ML estimate, since I am mainly interested in the HDRs here and not the point estimates.  You can play around with it yourself to see how \alpha affects the predictions.

The posterior distribution of the weights is p(\mathbf{w}\,|\,\mathbf{x},\mathbf{y}) = N(\mathbf{m}_N,\mathbf{S}_N) where

\mathbf{S}^{-1}_N = \mathbf{S}^{-1}_0 + \beta\Phi^T\Phi

\mathbf{m}_N = \mathbf{S}_N\left(\mathbf{S}^{-1}_0\mathbf{m}_0 + \beta\Phi^T\mathbf{y}\right)

SN <- solve(solve(S0) + beta * t(Phi)%*%Phi)
mN <-SN %*% (solve(S0) %*% m0 + beta * t(Phi) %*% rain$dec)

This can be simplified a bit with the prior I’ve used, but let’s just use the general equations here.

For a new explanatory value, x, the corresponding dependent variable will be distributed as y \sim N(\mathbf{m}_N^T\phi(x), \sigma_N^2(x)) where \sigma^2_N=\frac{1}{\beta}+\phi(x)^T\mathbf{S}_N\phi(x), and from this we can get the maximum posterior prediction (the mode of the posterior, which is the mean for the normal distribution) and the 95% HDR

predict.mean <- function(x) t(mN) %*% phi(x)
predict.sd <- function(x) sqrt(1/beta + t(phi(x)) %*% SN %*% phi(x))
qpredict <- function(x,p) qnorm(p, predict.mean(x), predict.sd(x))

lines(xs, sapply(xs, predict.mean), col='blue')
lines(xs, sapply(xs, qpredict, p = 0.025), col='lightblue')
lines(xs, sapply(xs, qpredict, p = 0.975), col='lightblue')

Results

I’ve plotted both the ML and the Bayesian results below:

You’ll notice that I got what I wanted: the HDRs are wider away from the dense x regions.

Of course, this is a bit of a cheat.  I get this because of the way I map my x values to the features more than from a proper modelling of the certainty/uncertainty of the values in various x regions.  (With a proper kernel approach I would get it without cheating).

The cheat is obvious if you notice, on the bottom left plot, that the HDR width varies dramatically in the range from 60 to 100 where there really are no x values to cause this.

Worse still, because of the way I pick the features, the two points to the right are predicted exactly which of course is just overfitting.  The way the features combine, the densities with mean 100 and 150 are pretty much determined by these two points only and the information from the other points does not alleviate the overfitting.

Oh well, cheat or no cheat, it was a fun little exercise.

144-156=-12

Bayesian linear regression

Sunday, May 24th, 2009

I’m currently reading Bayesian Statistics: An introduction by Peter M. Lee, and today I’m reading chapter 6 on linear regression.

I decided to play around with one of the examples in R.

The model is just linear regression as we know and love it, that is we have explanatory variables x_i and dependent variables y_i where the latter are assumed normally distributed around a mean that depends linearly on x_i:

y_i \sim N(\alpha + \beta(x_i-\bar{x}), \phi)

where \alpha (intercept), \beta (slope), and \phi (variance) are model parameters.

Oh, \bar{x} is the mean of the x_i values, in case you are wondering, so the line is “centered” around \bar{x}.

The joint prior of the model parameters is the reference prior: p(\alpha,\beta,\phi)\propto 1/\phi and since we assume that the x_i values are always known and thus not stochastic, the posterior is

 p(\alpha,\beta,\phi\,|\,\mathbf{x},\mathbf{y}) \propto p(\alpha,\beta,\phi) p(\mathbf{y}\,|\,\mathbf{x},\alpha,\beta,\phi)

The slope

After a lot of arithmetic (buy the book if you want to check it) we get a marginal posterior distribution for the slope given by

\frac{\beta-b}{s/\sqrt{S_{xx}}} \sim \mathbf{t}_{n-2}

where n is the number of observations, b=S_{xy}/S_{xx}, S_{xx}=\sum(x_i-\bar{x})^2, S_{yy}=\sum(y_i-\bar{y})^2, S_{xy} = \sum (x_i-\bar{x})(y_i-\bar{y}), s^2=S_{ee}/(n-2), S_{ee}=S_{yy}-S_{xy}^2/S_{xx}, and \mathbf{t}_{n-2} is the t-distribution with n-2 degrees of freedom.

Phewh.

It’s easy enough to code up in R, though.  The function

qslope <- function(p) b + qt(p,n-2)*(s/sqrt(Sxx))

computes the quantile corresponding to the p percentile, i.e. the q satisfying P(\beta \leq q) = p.

There’s an example in the book about predicting the rainfall in December based on the rainfall in November.  The data looks like this (when I type it into R):

year <- 1971:1980
nov <- c(23.9, 43.3, 36.3, 40.6, 57.0, 52.5, 46.1, 142.0, 112.6, 23.7)
dec <- c(41.0, 52.0, 18.7, 55.0, 40.0, 29.2, 51.0,  17.6,  46.6, 57.0)
rainfall <- data.frame(cbind(year=year, nov=nov, dec=dec))

and the data points are plotted below:

In the book he just calculates the 50% high density region (HDR) for the slope for the entire data set, but I just tried calculating it for subsets (prefixes) to see how it changes as the data set grows.  I start with the three first years, since for n\leq 2 the division by n-2 is problematic.

The numbers I get are these:

n HDR.low      HDR.high
3 -1.408329    1.995358
4 -0.4224851   1.67113
5 -0.3280555   0.7114888
6 -0.4495798   0.4083625
7 -0.3618486   0.429319
8 -0.3159196   -0.1145006
9 -0.2189725   -0.03743869
10 -0.2454263  -0.07713291

or plottet:

So the HDR gets narrower as we get more data.  Nice.  And it seems that there is a negative correlation between rainfall in November and December, since the HDR is negative (below the dashed line in the plot).

A plain old linear model also sees a negative slope, but not significantly negative:

> summary(lm(dec ~ nov, rainfall))

Call:
lm(formula = dec ~ nov, data = rainfall)

Residuals:
    Min      1Q  Median      3Q     Max
-25.578  -8.542   3.682  10.231  14.628 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  50.1320     8.1620   6.142 0.000276 ***
nov          -0.1613     0.1191  -1.354 0.212773   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 13.86 on 8 degrees of freedom
Multiple R-squared: 0.1864,    Adjusted R-squared: 0.08472
F-statistic: 1.833 on 1 and 8 DF,  p-value: 0.2128

Whether the Bayesian analysis is “significant” depends on how you choose to do hypothesis testing; the HDR above just tells you that the 50% most likely slope is negative.  With a 95% HDR the interval contains zero:

> qslope(.025)
[1] -0.4359772
> qslope(.975)
[1] 0.1134179

so the interval is [-0.44,0.11].

Predictions

Given the observed data (\mathbf{x},\mathbf{y}) and a new predictor x_0 we can compute the density for the corresponding dependent variable.  From the model, of course, it must be y_0 \sim N(\gamma, \phi) with \gamma = \alpha+\beta(x_0-\bar{x}).  It is just that \gamma is unknown since \alpha and \beta are.

A bit of arithmetic, once again, gives us the following distribution for \gamma:

\gamma \sim N(a+b(x_0-\bar{x}), \phi\{1/n + (x_0-\bar{x})^2/S_{xx})

which becomes

\frac{\gamma-a-b(x_0-\bar{x})}{s\sqrt{1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qgamma <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1/n+(x0-mean.x)**2/Sxx))

in R, if we integrate \phi out.

Combining this \gamma with the normal distribution around it for y_0 just adds a little variance (one \phi to be exact) to the distribution

y_0 \sim N(\gamma, \phi\{1 + 1/n + (x_0-\bar{x})^2/S_{xx}\})

which, if we integrate over \phi, becomes

\frac{y_0-a-b(x_0-\bar{x})}{s\sqrt{1+1/n+(x_0-\bar{x})^2/S_{xx}}} \sim \mathbf{t}_{n-2}

or

qy0 <- function(x0, p=0.5)
    (a+b*(x0-mean.x)) + qt(p,n-2)*(s*sqrt(1+1/n+(x0-mean.x)**2/Sxx))

in R.

Below, I’ve plotted the mean \gamma (black line), the 50% HDR for \gamma (blue lines) and the 50% HDR for y_0 (red lines) for the first 5 years and the full 10 years.

One thing that is obvious from the plots, but also if you look carefully at the expression for the distribution, is that the prediction distribution is more certain around the mean of the x_i values, \bar{x}.  It comes into the expression through the (x_0-\bar{x})^2 part of the variance.

I must admit that I am not particularly happy about that.

I mean, I can see that this is how the math goes, but I still don’t like it.

Since x_i is not stochastic we don’t really know how it should center around \bar{x}, so exactly how many observations do we expect to make around that point?  We cannot say!

Is there any particular reason that we should be more certain around \bar{x}?  Really only through how the model was constructed.  It is not something we get from the data.

I would much prefer if the variance was small in regions where we have observed a lot of x_i values and then larger in more sparse areas of the x axis.  Somehow that would capture the uncertainty in interpolating between areas where we have much information to areas where we have little.

This model is just not doing that, though.  It cannot, since the entire information we have about observed x_i values is captured only in their mean, \bar{x}.  It has to be that way from the model we use to related x_i to y_i.

This makes for nice simple mathematics, but maybe I would prefer something like a kernel method instead, for the reasons given above.  There, at least, the density of x_i values would have some impact on the certainty in my predictions.

143-154=-11

Simultaneous analysis of all SNPs in a genome-wide association study

Monday, September 15th, 2008

In our association mapping journal club a few weeks back, we discussed this paper (I just never got around to writing down my thoughts on it until now):

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Hoggard, Whittaker, De Iorio and Balding, PLoS Genetics 2008

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

I already heard about the method when I was visiting Imperial College to give a seminar last year, so I am happy that I can finally talk about it.

It is a pretty neat idea.

Regression analysis in association mapping

If you want to figure out which parameters are important for predicting some property, a good old statistical approach is regression analysis.

For a binary property, such as case or control in an association study, you could use logistic regression, but in general you construct some linear function of your parameters and transform them into the “property space” through a link function.

This setup gives you a “model” and depending on the link and the setup you have different ways of interpreting this as a statistical model with a corresponding likelihood function.

The coefficients in the linear combination of parameters are the parameters in the model, and you typically maximize the likelihood with respect to them to get your estimate for them.

In some cases you can directly interpret the parameters, but more often than not you are only interested in knowing whether there is strong evidence in the data that they should be non-zero, i.e. that the parameter in question actually has an effect on the property.

In an association study, you would use your SNPs as your parameters and you consider those SNPs with a non-zero coefficient associated with the disease.

Of course, it is never as simple as that.

Two things complicate matters: your best estimate of a coefficient will never actually be zero, so you want to test if they are significantly different from zero.  Another problem is that you have many more parameters (SNPs) than you have outcomes (individuals), so you will overfit from hell.

Strong “zero” priors

What they do in this paper is both simple and very clever.

They consider the problem in a Bayesian setting and put strong priors on the coefficients, that will tend to keep them at zero unless the signal in the data  is strong enough to pull them away from there.

They then test for association by testing if the mode of the posteriors for these parameters have moved away from zero.

A very nice consequence of this is that you can analyse the entire data at the same time, rather than testing markers individually, which means that if several markers are in LD with a causal marker, you will tend to only pick one of them and recognize that the signal in the others is essentially the same signal.

It also seems quite computationally feasible.  A few hours on a desktop computer to analyse a GWA data set.


Clive J. Hoggart, John C. Whittaker, Maria De Iorio, David J. Balding, Peter M. Visscher (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies PLoS Genetics, 4 (7) DOI: 10.1371/journal.pgen.1000130

Bayesian Graphical Models for Genomewide Association Studies

Thursday, May 15th, 2008

ResearchBlogging.org
Lately I have been interested in Bayesian approaches for genome wide association mapping. These have the benefit of often being able to consider all data in a single model and from that score markers with their probability of being related to the disease without the usual multiple testing problems.

So for our journal club today, I picked the following paper for us:

Bayesian Graphical Models for Genomewide Association Studies
Verzilli, Stallard and Whittaker
American Journal of Human Genetics 79(1) 100-112

Abstract

As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.

The idea in the paper is to capture the joint distribution of all genotypes and phenotypes in a graph model — that explicitly capture which model variables are independent and which are not — and from this read off the probability that each genotype is independent from the phenotype or not.

By putting a few restrictions on the topology of the kind of graphs they consider, it is possible to calculate the likelihood of any given graph by, essentially, running through the graph and calculate conditional probabilities of all cliques in the graph, where each clique is modeled as a multinomial distribution over genotypes, possibly together with phenotypes. With a small re-write, this becomes a product of independent probabilities divided by another product of independent probabilities.

p(C1) x p(C2) x … x p(CL) / p(S1) x p(S2) x … x p(SR)

Using Dirichlet priors, the parameters of the multinomial distributions can be integrated out, and the resulting expression is very fast to compute, making it feasible to sample over graphs in an MCMC.

This integral, though, I don’t think is correct. It comes down to those products of independent probabilities. These guys depend on the multinomial parameters, and while the terms in the nominator depend on disjoint parameters and the same for the denominator, there is an overlap in the parameters used in the nominator and denominator that introduces dependencies.

p(C1|θ) x p(C2|θ) x … x p(CL|θ) / p(S1|θ) x p(S2|θ) x … x p(SR|θ)

= p(C1|θ1) x p(C2|θ2) x … x p(CL|θL) / p(S1|θ1) x p(S2|θ2) x … x p(SR|θR)

When integrating over the set of parameters, θ, if these could be split into disjoint parameters for each of the independent probabilities, the integral could be solved for each term individually — which is easy with a conjugate prior. When the parameters cannot be split in disjoint sets — and here they cannot — then that trick doesn’t work.

In the paper they do it anyway, though.

This just means that their model introduces more independence than it was really supposed to, and that it the model they describe is not really the model they run the MCMC on, but all that really matters is how well the method identifies genotype-phenotype association, and that it seems to be pretty good at.


VERZILLI, C. (2006). Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics, 79(1), 100-112. DOI: 10.1086/505313