Uncertainty and randomness, part I
Sunday, August 9th, 2009When I wrote about confidence intervals the other day, I was actually inspired to think about it from two papers from here:
- Dicing with the unknown, Tony O'Hagan
- The boxer, the wrestler, and the coin flip: A paradox of robust Bayesian inference and belief functions, Andrew Gelman
The topic of those two papers is two kinds of uncertainty: epistemic uncertainty and aleatory uncertainty, where the former is uncertainty due to lack of knowledge while the latter is uncertainty due to randomness.
If we're trying to estimate a parameter as in the confidence interval post, we are dealing with lack of knowledge. The parameter is assumed fixed and not random at all, but we don't know what it is so there is some uncertainty in our estimate. The more data we have available, the better we can estimate the parameter, so it is possible for us to reduce the uncertainty by increasing our knowledge.
If we want to predict if the toss of a fair coin comes up head, we are dealing with randomness. If it is a fair coin, then there is a 50/50 chance of the next flip coming up head, and no matter how many times we repeat the experiment the uncertainty about the outcome is the same.
I'm not going to write much more about this, since you are better off reading the papers anyway, but it got me thinking about uncertainty in flipping (un-)fair coins and that is what this post and the next is about.
The stuff below doesn't really relate to epistemic or aleatory uncertainty, so you don't have to read the papers to read on here. It is just two different models of coin tossing and having some fun with that...
Unknown bias
First let us consider a scenario where we have a coin that we do not necessarily know if is fair or not, but we want to figure out the probability of tossing head. So the model of the system is a simple Bernoulli distribution where we have a single parameter \(p\) and where we toss head with probability \(p\) and tail with probability \(1-p\).
If we toss the coin \(n\) times the likelihood of getting \(s\) heads is binomially distributed and the likelihood is proportional to \(p^s(1-p)^{n-s}\). The maximum likelihood estimate of \(p\) is \(s/n\).
You can play around with that yourself with a bit of R code as below:
p <- 0.39
ns <- seq(1,100)
est <- sapply(ns, function(n) rbinom(1,n,p)/n)
opar <- par(mar=c(5,4,2,4)+0.1)
plot(ns, est, ylim=c(0,1),
ylab='Maximum likelihood estimates (s/n)',
xlab='Number of tosses')
abline(h=p)
mtext(paste(" p = ", p), at=p, side=4, las=1)
par(opar)

The parameter is fixed - in the example above as 0.39 - and the entire uncertainty is in the estimate. That uncertainty decreases with the number of tosses - if you want to you could play around with confidence intervals and see this. I won't do this here but instead use credible intervals since I used confidence intervals in the previous post.
Anyway, as I said, the parameter is fixed but unknown. In a "frequentist" analysis we would really consider the parameter fixed and consider the estimate random. In a "Bayesian" analysis, on the other hand, we would still consider the parameter random. We would just learn more and more of the distribution of it as we toss the coin.
Whether you can consider an unknown but fixed parameter (uncertainty due to lack of knowledge) as a stochastic variable (uncertainty due to randomness) is a philosophical question more than a mathematical one, and you are welcome to join in on the frequentist vs Bayesianist fight any day, I am sure.
I'm going to do a Bayesian analysis here, since I did a frequentist analysis in the confidence interval post, but I'll be back to a frequentist analysis in the next post.
Priors and posteriors
Ok, so to analyse this system in a Bayesian way, we need to model the parameter \(p\) as a stochastic variable. We do this by giving it a prior distribution \(\pi(p)\). This should capture the uncertainty we have about \(p\) before we start tossing the coin.
In a frequentist analysis we have no way of quantifying our intuition about \(p\), such as we expect it to be closer to 0.5 than to 0 or 1.
On the other hand, because we need to provide a prior, we have no way in a pure Bayesian analysis of claiming complete prior ignorance. There is a concept of "uninformative priors" or "reference priors", but claiming ignorance in Bayesian analysis is not as easy as you might think. Anyway, that is way beyond the scope of this blog post, so I'll be on with the analysis.
For mathematical convinience we prefer something called conjugate priors, and for Bernoulli distributions that is the beta distribution. Don't worry about what a conjugate prior is, it is not essential for the analysis here.
The beta distribution is parameterised with two "hyper parameters" \(\alpha\) and \(\beta\) and has density \(\pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1}\) with normalisation constant \(1/B(\alpha,\beta)\) where of course
\[B(\alpha,\beta)=\int_0^1 p^{\alpha-1}(1-p)^{\beta-1}\,\mathrm{d}p.\]
Depending on \(\alpha\) and \(\beta\) the beta distribution can take various shapes. Whenever \(\alpha=\beta\) the density is symmetric and (when \(\alpha=\beta\geq 1\)) with mode at 0.5, so if your prior belief is that coins are likely to be unbiased, then you want equal \(\alpha\) and \(\beta\). The greater the parameters, the more "pointy" the density is around the mode, so the higher \(\alpha\) and \(\beta\) the stronger you make your prior belief about the distribution of \(p\). With \(\alpha=\beta=1\) you get a uniform distribution.
You can try it out with this R code
plot.beta <- function(a,b) {
p <- seq(0,1,length.out=100)
d <- dbeta(p,a,b)
plot(p,d,
main=substitute(paste(alpha==a, " and ", beta==b),
list(a=a,b=b)),
type='l', xlab='', ylab='')
}
opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.beta(1,1)
plot.beta(2,2)
plot.beta(4,3)
plot.beta(2,6)
par(opar)

The prior distribution is capturing the knowledge we have about \(p\) (or think we have about it) before we do any experiments by tossing the coin. Once you've picked one that captures your intuition about how you would expect \(p\) to be, the trick is to use the experiments to change the distribution of \(p\) to now capture both your prior expectation and the knowledge you have gained about \(p\) from doing your experiments.
Let \(\mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s}\) denote the likelihood of \(p\) when we observe \(s\) heads out of \(n\) tosses, i.e. \(\mathrm{lhd}(s,n;p)\) is the probability of \(s\) heads out of \(n\) tosses conditional on \(p\). The joint probability of \(s\), \(n\) and \(p\) is then \(\mathrm{lhd}(s,n;p)\pi(p)\) and the posterior probability for \(p\), \(P(p)\) is proportional to this \(P(p)\propto \mathrm{lhd}(s,n;p)\pi(p)\). This is always the case; the posterior is always proportional to the prior times the likelihood, a result you might know as Bayes' rule.
The posterior probability is interesting because it gives you a distribution for \(p\) that depends on the results of your experiment.
In many way this posterior does what the confidence intervals does in frequentist analysis. It captures the parameter estimate and the uncertainty in it. The difference here is that in the Bayesian approach the parameter really is considered a random parameter, distributed as the prior before the experiment and as the posterior after the experiment.
Okay, since \(\mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s}\) and \(\pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1}\) the posterior is \(P(p) \propto p^s(1-p)^{n-s}p^{\alpha-1}(1-p)^{\beta-1}=p^{s+\alpha-1}(1-p)^{n-s+\beta-1}\).
So if we start with a \(\mathrm{beta}(\alpha,\beta)\) prior and observe \(s\) heads in \(n\) trials, we end up with a \(\mathrm{beta}(\alpha+s,\beta+n-s)\) posterior. It is this nice relationship that makes the beta distribution a conjugate prior, in case you are wondering.
When coming up with the prior, you can therefore think of \(\alpha\) and \(\beta\) as "pseudo counts" of heads and tails. Results you never actually saw, but they behave just as counts of head and tail.
We can try out the Bayesian analysis like this:
plot.analysis <- function(a,b,p,n) {
# get prior
x <- seq(0,1,length.out=1000)
prior <- dbeta(x,a,b)
# perform "experiment"
s <- rbinom(1,n,p)
# get likelihood
lik <- dbeta(x,s,n-s)
# get posterior
post <- dbeta(x,a+s,b+n-s)
# plot results...
plot(x, post, main='Prior, likelihood and posterior',
type='l', xlab='', ylab='')
lines(x, lik, col='blue')
lines(x, prior, col='red')
abline(v=p, lty='dashed')
}
opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.analysis(2,2,0.39,5)
plot.analysis(2,2,0.39,10)
plot.analysis(2,2,0.39,50)
plot.analysis(2,2,0.39,100)
par(opar)
Here I'm plotting the prior - where I've chosen a beta(2,2) prior - in red, the likelihood in blue and the posterior in black.

When we have only done a few experiments, the prior will dominate and we see the posterior as somewhere in between the likelihood and the prior. With many experiments, the likelihood dominates and the posterior resembles it more and more. This is as it should be, considering the the more data we have from experimenting, the more we should trust the experiments over our previous intuition.
Credible intervals
With a posterior in hand, we have all the information about our parameter that we can get in a Bayesian framework, but the posterior is not always easy summarise (or visualise in high dimensions) so we can also reduce it to an interval in which the parameter is likely to be.
This is like the confidence intervals in frequentist analysis, but this time the parameter really is stochastic so we have a fixed interval, the credible interval, where there is a, say, 95% chance that the parameter falls in.
To get the 95% credible interval, we just get the 0.025 and 0.975 quantiles of the posterior.
qbeta(c(0.025,0.975),a+s,b+n-s)
If the experimental results were really fixed in stone (so the number of heads and number of tails were fixed) and the coin bias random, this would be a fixed interval where the parameter \(p\) would fall within in 95% of the cases.
# data -- assumed fixed forever... n <- 100 s <- 47 # prior a <- 2 ; b <- 2 # credible interval CI <- qbeta(c(0.025,0.975), a+s, b+n-s) # samples of p ps <- rbeta(100, a+s, b+n-s) plot(ps, ylim=c(0,1), main='Credible interval and samples of p', ylab='Samples') abline(h=CI, lty='dashed')

Of course, in this case this is really only something we pretend. It is in fact \(p\) that is fixed and we can re-do the experiment whenever we want, and in that case we would see a fixed \(p\) and a random credible interval, so there really is very little difference between credible intervals and confidence intervals in practice...
# experiments p <- 0.39 n <- 100 s <- ns <- seq(1,100, by=5) heads <- sapply(ns, function(n) rbinom(1,n,p)) # prior a <- 2 ; b <- 2 # point estimate modes <- mapply(function(n,s) (a+s)/(n+a+b), ns, heads) # credible interval lower <- mapply(function(n,s) qbeta(0.025,a+s,b+n-s), ns, heads) upper <- mapply(function(n,s) qbeta(0.975,a+s,b+n-s), ns, heads) plot(ns, modes, ylim=c(0,1), main='Point estimates and credible intervals', xlab='Number of tosses', ylab='Estimate') lines(ns, lower) lines(ns, upper) abline(h=p, lty='dashed')

Ok, enough about this setup.
In a later post I will consider an actual random \(p\) parameter and see how we can estimate its distribution from coin tosses. Stay tuned!
--
221-222=-1