## Archive for August 7th, 2009

### How do scientists really use computers?

Friday, August 7th, 2009

There's a nice short article in American Scientist titled How do scientists really use computers?

An interesting read if you, as I, teach life scientists (and not computer scientists) computer science.

The conclusion doesn't surprise me much, though:

Our results can be interpreted in many ways, but I think two things are clear. The first is that if funding agencies, vendors and computer science researchers really want to help working scientists do more science, they should invest more in conventional small-scale computing. Big-budget supercomputing projects and e-science grids are more likely to capture magazine covers, but improvements to mundane desktop applications, and to the ways scientists use them, will have more real impact.

Even at BiRC where we do a lot of genome analysis that really do need computer grids, most of our computer use is desktop computers.

--

219-220=-1

### The problem with confidence intervals

Friday, August 7th, 2009

I've ranted plenty about the problem with p-values and the common misunderstanding that the p-value is the probability of the null model being true.  But what about confidence intervals?

It is a common misunderstanding that if you have a 95% confidence interval, then the parameter you are estimating is within the interval with 95% probability.  This turns out to be just slightly wrong, and where the misunderstanding about p-values simply doesn't make sense at all considering how p-values are used, you cannot really run into much trouble with this misunderstanding.

### Inference uncertainty

Okay, so what is a confidence interval, and why is the above a misunderstanding?

A confidence interval is something we use when estimating parameters that is supposed to capture the uncertainty there is in the inference.

Let's say I want to know the weight of the water in a small lake close to where I live.  Don't ask why, what I do in my free time is none of your business, and anyway it is just an example so play along!  So I go down to the lake and get a liter of the water and weigh it. Yeah, it is probably going to be close to 1kg, since that would be the weight of pure water and the lake water presumably is mostly water, but it will not be exactly that because of all kinds of impurities in it.

So the parameter I'm interested in is the weight of the water.  The water is probably reasonably homogeneous so it doesn't matter where I sample the water.

If I just do this once, I will have a single measurement and that will be my estimate of the parameter, but homogeneous or not there might be a bit more mud or something in the sample I take than usual, or a bit less, so there could be some measurement error so my one measure is not the true value.

You all know how to get around this: you make several samples and use the average as the estimate.

All well and good, but once you start making several samples to get the parameter, you have already admitted that you have some uncertainty in your measurements, and that leaks into uncertainty in your estimate.  The degree of uncertainty, however, cannot be seen from the final estimate.

This uncertainty depends on the variation in your measures, of course - if they vary a lot you are less certain than if they are all roughly the same - but also on the number of samples - with just the first sample there is no variation in measures but that doesn't mean that you will trust that one measure more than the average of ten measures.

### Confidence intervals

It is this uncertainty that is captured by the confidence interval.

Before we can quantify the uncertainty at all we need to set up a model of the data, of course.  There are no "one size fits all" with that, but lots of simple models that often work.

For the water weight example a good first attempt would be to assume that the measurements were stochastic variables distributed as where is the true weight - the parameter we wish to infer - and the variance in measures caused by the measurement uncertainty, a nuisance parameter.

When estimating we sample observations and use the average value, so from the model of individual measurements we also get a model of the estimator.  In this particular case from the central limit theorem.

In general, a confidence interval for parameter is two statistics and - functions of the stochastic variable - satisfying that where is the confidence level, typically 95%.

This might look a little complicated, but it isn't really.  Just like the mean of a set of observations - we could call it - is a way of summarising an aspect of the data, so are and , and just as the mean of observations is a stochastic variable (until you actually make the observations) so will and be.

The trick, of course, is defining them so they satisfy .  In general this is not easy to do, but for our example here it is.

We know that our estimate - the mean of observations - is distributed as when is the mean of observations.  That's what we got from the central limit theorem.

So given and we can easily define an interval that will fall into with probability 95%, and the obvious choice here would be the symmetric one around .  We could call this the 95% interval (but not confidence interval quite yet since it doesn't refer to any given estimate or observations yet).

It has the properties we would expect from a confidence interval, though, it depends on the underlying uncertainty in the model, , and it gets smaller as we get more data and reduce our uncertainty.

The problem is, of course, that this is completely useless since we don't know and and we need those guys to get this interval.

Well, not quite useless.

Because we know the distribution of we know how much larger or smaller than we expect it to be (the 95% interval from above), and since is symmetrically distributed around it is the same distances when is above and when it is below .

So we can get the 95% confidence interval for by taking the width of the 95% interval (for ) but centering it at .

For the confidence interval defined this way, for not to be contained in the confidence interval, would have to be either smaller than the lower limit of the 95% interval or larger than the upper limit of the 95% interval, in other words for not to be contained in the 95% confidence interval, would have to not be contained in the 95% interval which it is with 95% probability, so the confidence interval defined this way satisfy the requirement we made for it.

(Yes, I know that I also need to know for this to work, and there is some added uncertainty from using an estimate rather than the true value.  The width of the interval, for example, will be a function of the estimate of the variance if you do as above, and if you underestimate your interval will be too narrow.  Go read a statistics text book if this bothers you, for the example here I'm just going to ignore it.)

### What's with the misunderstanding about confidence intervals?

Ok, so that really does sound like if we have a confidence interval, then the true parameter is within the interval with 95% probability, right?

Yes, almost, which is why I wrote that this misunderstanding doesn't matter much and won't get you into any problems (unlike the p-value misunderstanding that leads to craziness and probably cases cancer and global warming as well).

The distinction is almost philosophical and has to do with what we consider stochastic and what we consider fixed.

Here, we consider the parameter unknown but fixed.  It doesn't vary and there is no stochasticity to it.  So it simply doesn't make sense to talk about the probability of it being in any given interval.  It is either in the interval or not in the interval and that is all there is to it.

The confidence interval, on the other hand, is stochastic.  If we did the experiment 10 times, we would get 10 different confidence intervals, all which would contain the true parameter with probability 95%.

The confidence interval will contain the true parameter with 95% probability, but the true parameter does not fall within the interval with 95% probability.

The difference is so subtle that it borders the ridiculous to even talk about it.

### Credible intervals

In Bayesian statistics it is the other way around.  There you do get fixed intervals and random parameters.

Here parameters we are uncertain about are not considered fixed but unknown, but are assumed to be stochastic. Based on observations and a prior probability distribution you would infer a credible interval that is then considered fixed, while the parameter is stochastic (but falls within the interval with 95% probability).

Anyway, if you have a confidence interval, don't say "with 95% probability the parameter lies in the interval [a,b]..." since strictly speaking that is incorrect.  For a credible interval it would be correct.

In practise, it makes no difference what so ever, so don't worry too much about it.

--

219-219=0