Mouse genome completed (again?)

When I looked up some genome sequencing statistics a little while back, I was surprised to learn that only the human and the mouse genome is considered “completed”.  Maybe the mouse genome wasn’t anyway, ’cause this month there’s a PLoS Biology paper on the completed genome.

Church DM, Goodstadt L, Hillier LW, Zody MC, Goldstein S, et al. (2009) Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse. PLoS Biol 7(5): e1000112. doi:10.1371/journal.pbio.1000112

Oh well, in any case the distinction between a draft and complete genome sequence is a bit artificial anyway.  It is really only a matter of degree.  The “complete” human genome sequence still has a lot of gaps and is still being updated (UCSC’s genome browser updated hg18 to hg19 in February this year), and in any case a genome sequence is really only a reference sequence and a lot of the interesting stuff is in genomic variation, so is a genome really “complete” before we have at least a map of the common variations?

147-159=-12

Lots of links about commenting…

A Blog Around the Clock has a list of links to blog posts about commenting on (scientific) papers.

There have been quite a few posts over the last few days about commenting, in particular about posting comments, notes and ratings on scientific papers. But this also related to commenting on blogs and social networks, commenting on newspaper online articles, the question of moderation vs. non-moderation, and the question of anonymity vs. pseudonymity vs. RL identity.

Read the post to get all the links.

I must admit that I have never left a comment on an online paper.  If I blog about a paper, I leave a traceback, but that is as far as it goes.

Since putting a review of a paper on my blog, just to add a comment, is a lot of work, I guess I should just get used to leaving comments instead.

Still, I am reluctant to comment on papers.  I don’t mind firing off a half thought through comment off at a blog, but I feel that for a scientific paper I should make sure I understand all the details of the paper before I start commenting on it.  I guess I just have to overcome that feeling.

147-158=-11

Are Dog Breeds Actually Different Species?

Great piece on evolution in action at Scientific America: A dog breeds actually different species?

So here’s the idea you’ve been patiently waiting for: let’s simply say that dog breeds are different species. Take two that Coyne highlights for their differences—the 180-pound English Mastiff and the two-pound Chihuahua. They’re both considered members of Canis lupus familiaris, and in principle artificial insemination could produce some sort of mix or possibly an exploding Chihuahua. But face it, the only shot a male Chihuahua has with a female Mastiff involves rock climbing or spelunking equipment.

Biologists clearly continue to include the two types of dogs within the same species out of modesty. But with creationists fighting evolution education throughout the country, the time calls for bold action. Let’s reassign the trembling, bug-eyed Chihuahua to its own species. Voilà, humans have observed speciation. We could call the new dog C. nervosis. Or C. cantsee­theparadis. Or C. canyoupress­twelveformepleasis. Amazingly, right now Chihuahuas are still considered C. lupus familiaris, a subspecies of wolf. And calling a Chihuahua a wolf is like calling someone at the Discovery Institute a scientist.

145-157=-12

Playing some more with Bayesian linear regression

Ok, so I decided to play a bit more with Bayesian linear regression.  I was aiming for something that would give me a smaller variance around $$x$$ values where I have a lot of samples, and since I could not get that with a “proper linear model” I hacked it about a bit.

I’m using the same data as before

but a different model.  One that looks a bit more like a kernel method (that is how I get the desired feature), but is still formulated as a linear regression model.

The notation and math is taken from Bishop Pattern recognition and machine learning.

The model

The main difference is the way the relationship between the explanatory variables, $$x_i$$, and dependent variables, $$y_i$$, is modelled.  The explanatory variables are not used directly, but mapped into a vector of “features”, $$\phi(x_i)$$ that is then used in the linear model $$y_i \sim N( \mathbf{w}^T \phi(x_i), 1/\beta)$$.  For convenience we work with the precision $$\beta$$ rather than variance $$\sigma^2=1/\beta$$.

This is still a linear model, only it is linear in the weights we train, $$\mathbf{w}$$, and in the features $$\phi(x_i)$$, but not in the explanatory variables $$x_i$$.

I’ve used five features: constant 1 to get an intercept, and four normal densities, all with the same variance but with means in 0, 50, 100, and 150, respectively.  That covers the range of $$x_i$$ values nicely, so that should give me an okay feature set.

Implemented in R, the $$\phi(x)$$ function is then:

I guess there is no point in trying to interpret this model.  For the linear model yesterday, we can use the parameters to reason about the underlying system.  Not so for this arbitrary set of features.  This is a model that is only aimed at predicting new dependent variables; it is not trying to model some underlying physical reality we can reason about from the model.

Maximum likelihood predictions

Let $$\Phi$$ be the matrix with a row per $$(x_i,y_i)$$ in our data set, where row $$i$$ is $$\phi(x_i)$$.

The maximum likelihood estimate of the weights is given by $$\mathbf{w}_{\mathrm{ML}} = \left(\Phi^T\Phi\right)^{-1}\Phi^T\mathbf{y}$$ and the maximum likelihood of the precision is given by $$1/\beta_{\mathrm{ML}}=\frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{w}_{\mathrm{ML}}\phi(x_i))^2$$.

Given a new $$x$$ value, calculating the best predictions for the corresponding $$y$$ value or the 95% HDR is straight forward.

Below I’ve plotted this for five of the data points and the full data with 10 data points.  (There’s some numerical problems when I use the first five data points, so I’ve used the last 5, but you get the idea).

Notice that here the width of the HDR does not depend on the $$x$$ value.  The variance is estimated from the entire data set and just used in each position.

Bayesian predictions

For the Bayesian approach I’ll assume that the precision is know (and equal to the $$\beta$$ I get from the maximum likelihood estimate).  This is just because it makes it easier, since then I have all the equations I need from my text book.  Otherwise I’d have to integrate it out and get some t density instead of the normal, but I won’t bother.

With that assumption, I only need to worry about the weights, which will get the conjugate prior $$p(\mathbf{w}) \sim N(\mathbf{m}_0, \mathbf{S}_0)$$.

I use a prior mean of zero, $$\mathbf{m}_0=\mathbf{0}$$, and the covariance matrix $$\mathbf{S}_0=\alpha^{-1}\mathbf{I}$$.

So $$\alpha$$ is the precision of the prior.  The smaller the precision, the less influence the prior has on the estimate and for $$\alpha\to 0$$ the maximum posterior estimate will tend towards the maximum likelihood estimate.  I’ve picked a pretty small $$\alpha$$ to get a prediction close to the ML estimate, since I am mainly interested in the HDRs here and not the point estimates.  You can play around with it yourself to see how $$\alpha$$ affects the predictions.

The posterior distribution of the weights is $$p(\mathbf{w}\,|\,\mathbf{x},\mathbf{y}) = N(\mathbf{m}_N,\mathbf{S}_N)$$ where

$$!\mathbf{S}^{-1}_N = \mathbf{S}^{-1}_0 + \beta\Phi^T\Phi$$

$$!\mathbf{m}_N = \mathbf{S}_N\left(\mathbf{S}^{-1}_0\mathbf{m}_0 + \beta\Phi^T\mathbf{y}\right)$

This can be simplified a bit with the prior I’ve used, but let’s just use the general equations here.

For a new explanatory value, $$x$$, the corresponding dependent variable will be distributed as $$y \sim N(\mathbf{m}_N^T\phi(x), \sigma_N^2(x))$$ where $$\sigma^2_N=\frac{1}{\beta}+\phi(x)^T\mathbf{S}_N\phi(x)$$, and from this we can get the maximum posterior prediction (the mode of the posterior, which is the mean for the normal distribution) and the 95% HDR

Results

I’ve plotted both the ML and the Bayesian results below:

You’ll notice that I got what I wanted: the HDRs are wider away from the dense $$x$$ regions.

Of course, this is a bit of a cheat.  I get this because of the way I map my $$x$$ values to the features more than from a proper modelling of the certainty/uncertainty of the values in various $$x$$ regions.  (With a proper kernel approach I would get it without cheating).

The cheat is obvious if you notice, on the bottom left plot, that the HDR width varies dramatically in the range from 60 to 100 where there really are no $$x$$ values to cause this.

Worse still, because of the way I pick the features, the two points to the right are predicted exactly which of course is just overfitting.  The way the features combine, the densities with mean 100 and 150 are pretty much determined by these two points only and the information from the other points does not alleviate the overfitting.

Oh well, cheat or no cheat, it was a fun little exercise.

144-156=-12

Last week in the blogs

It is Monday again, so you have a week of work to look forward to, but first, here’s a list of blog posts and news items from last week I think you should read!

Biology

Genetics

Intellectual Property

Paleontology

Programming

Space

Statistics

The Web

144-155=-11