Posts Tagged ‘R’

R flashmob project

Friday, September 4th, 2009

The R Flashmob Project:

You are invited to take part in R Flashmob, the project that makes the world a better place by posting helpful questions and answers about the R statistical language to the programmer’s Q & A site stackoverflow.com

The event takes place on September the 8th at stackoverflow.com, follow the link above to find the time in your time zone.

247-255=-8

Uncertainty and randomness, part II

Monday, August 10th, 2009

Ok, so in my previous post I considered tossing a (biased) coin and trying to figure out the probability of heads over tails.  The probability of tossing heads, p is a property of the coin and fixed, it is just unknown, so there will be some randomness in estimating it due to the randomness in coin tosses we use to estimate.

In the Bayesian analysis I used, I treated the parameter p as stochastic, because that is how that kind of analysis works, but it never really was.  In this post I will consider a random p.

I am going to consider an experiment where I am picking coins out of a bag, tossing them and recording whether they are heads or tails, and then throwing away the coin to pick another one.  I think that the coins are biased, but I don’t know how.  Can I figure that out?

Formalizing this mathematically, I am assuming that I am drawing coins each with a p probability of coming up head, where the ps are beta distributed with (unknown) parameters \alpha and \beta.  I get to toss each coin once to see if it turns up heads or tails, and then I have to throw the coin away, so for coin i I have a Bernoulli distribution with parameter p_i.

a <- 3 ; b <- 4
n <- 100
outcomes <- rbinom(n,1,rbeta(1,a,b))
s <- sum(outcomes)

The task is to figure out the distribution of ps, which means estimating hyper parameters \alpha and \beta.

Any ideas?

Doing the same as last time?

In the setup in my last post, we modelled the situation very similarly.  We drew a p at random and then tossed coins and got a distribution for p that way.  Could that work again?

We could pick a prior for p, say \mathrm{beta}(2,2) as last time, calculate the posterior based on the number of heads out of n tosses and see if that tends toward the real distribution of ps.

plot.analysis <- function(prior.a,prior.b ,real.a,real.b, n) {
  x <- seq(0,1,length.out=1000)

  # real p distribution
  real <- dbeta(x,real.a,real.b)

  # get prior
  prior <- dbeta(x,prior.a,prior.b)

  # perform experiment
  s <- sum(rbinom(n,1,rbeta(1,real.a,real.b)))
  # get posterior
  post <- dbeta(x,prior.a+s,prior.b+n-s)

  # plot results...
  plot(x, post, main='Prior, likelihood and posterior',
       type='l', xlab='', ylab='')
  lines(x, real, col='green')
  lines(x, prior, col='red')
}

opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.analysis(2,2, 3,4, 5)
plot.analysis(2,2, 3,4, 10)
plot.analysis(2,2, 3,4, 50)
plot.analysis(2,2, 3,4, 100)
par(opar)

(Ups, the title for these plots are wrong … there is the real distribution of ps but no likelihood … but I really cannot be bothered to re-do the plot right now…)

Ok, clearly that didn’t work.  Why?

For one thing, even if p is stochastic in this setup, there is only one p, but in the setup we are modelling there is one for each coin toss.

By the very nature of the beta distribution, the more heads and tails we connect, the “pointy’er” it will become, so if the real distribution of ps is relatively flat, it will look less and less like it.

A simple maximum likelihood solution

Since all we have to do to estimate the distribution of ps is to estimate the hyper parameters \alpha and \beta, we could just take the maximum likelihood approach.  We just need to work out the likelihood of our model and then maximize it with respect to the two.

If \pi(p;\alpha,\beta) is the distribution of ps, then the likelihood of ps and heads and tails is

\mathrm{lhd}(\alpha,\beta,p_1,\ldots,p_n) = \prod_i p_i^{h_i}(1-p_i)^{1-h_i} \pi(p_i)

where h_i is 1 if the ith coin toss is head and 0 otherwise.

We don’t want to maximize with respect to all the p_is – and with one coin toss for each we don’t have a lot of information to do that anyway – but we can just integrate them out, and since they are independent that gives us

\mathrm{lhd}(\alpha,\beta) = \prod_i \int p_i^{h_i}(1-p_i)^{1-h_i} \pi(p_i;\alpha,\beta) \,\mathrm{d}p_i

or

\log\mathrm{lhd}(\alpha,\beta) = s \log \int p\pi(p;\alpha,\beta) \,\mathrm{d}p + (n-s)\log \int (1-p)\pi(p;\alpha,\beta) \,\mathrm{d}p

where n is the number of coin tosses and s the number of heads.

With a beta prior for p

\pi(p;\alpha,\beta) = \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}

where B(\alpha,\beta)=\int p^{\alpha-1}(1-p)^{\beta-1} \,\mathrm{d}p we get

\int p\pi(p;\alpha,\beta) \,\mathrm{d}p = B(\alpha+1,\beta)/B(\alpha,\beta)

and

\int (1-p)\pi(p;\alpha,\beta) \,\mathrm{d}p = B(\alpha,\beta+1)/B(\alpha,\beta)

so the log-likelihood is

\log\mathrm{lhd}(\alpha,\beta) = s\log B(\alpha+1,\beta) + (n-s)\log B(\alpha,\beta+1) - n\log B(\alpha,\beta).

You can probably optimize this with respect to \alpha and \beta analytically, but I’m just going to use a numerical optimisation to try it out:

a <- 3 ; b <- 4
n <- 100
s <- sum(rbinom(n,1,rbeta(1,a,b)))
logL <- function(a,b, n,s) {
  s*lbeta(a+1,b) + (n-s)*lbeta(a,b+1) - n*lbeta(a,b)
}
search <- nlm(function (x) -logL(x[1],x[2],n,s), c(3,4))
mle.a <- search$estimate[1]
mle.b <- search$estimate[2]
opar <- par(mfrow=c(1,2))
xs <- seq(1,10,length.out=500)
ys <- sapply(xs, function(x) logL(x,4,n,s))
plot(xs,ys,type='l',xlab=expression(alpha),main='Likelihood',ylab='logL')
abline(v=mle.a)
mtext(mle.a,at=mle.a)
ys <- sapply(xs, function(x) logL(3,x,n,s))
plot(xs,ys,type='l',xlab=expression(beta),main='Likelihood',ylab='logL')
abline(v=mle.b)
mtext(mle.b,at=mle.b)
par(opar)

and we can plot the true p density to compare it with the inferred:

x <- seq(0,1,length.out=100)
true <- dbeta(x,3,4)
inferred <- dbeta(x,mle.a,mle.b)
plot(x,true,type='l',col='red',
     main='True and inferred distribution',
     xlab='', ylab='',
     ylim=range(true,inferred))
lines(x,inferred, col='blue')

You might get different results here; when I experimented with it I didn’t always get results this close to the true values, but I picked a nice result for the plots here…

222-223=-1

Uncertainty and randomness, part I

Sunday, August 9th, 2009

When I wrote about confidence intervals the other day, I was actually inspired to think about it from two papers from here:

The topic of those two papers is two kinds of uncertainty: epistemic uncertainty and aleatory uncertainty, where the former is uncertainty due to lack of knowledge while the latter is uncertainty due to randomness.

If we’re trying to estimate a parameter as in the confidence interval post, we are dealing with lack of knowledge.  The parameter is assumed fixed and not random at all, but we don’t know what it is so there is some uncertainty in our estimate.  The more data we have available, the better we can estimate the parameter, so it is possible for us to reduce the uncertainty by increasing our knowledge.

If we want to predict if the toss of a fair coin comes up head, we are dealing with randomness.  If it is a fair coin, then there is a 50/50 chance of the next flip coming up head, and no matter how many times we repeat the experiment the uncertainty about the outcome is the same.

I’m not going to write much more about this, since you are better off reading the papers anyway, but it got me thinking about uncertainty in flipping (un-)fair coins and that is what this post and the next is about.

The stuff below doesn’t really relate to epistemic or aleatory uncertainty, so you don’t have to read the papers to read on here.  It is just two different models of coin tossing and having some fun with that…

Unknown bias

First let us consider a scenario where we have a coin that we do not necessarily know if is fair or not, but we want to figure out the probability of tossing head.  So the model of the system is a simple Bernoulli distribution where we have a single parameter p and where we toss head with probability p and tail with probability 1-p.

If we toss the coin n times the likelihood of getting s heads is binomially distributed and the likelihood is proportional to p^s(1-p)^{n-s}.  The maximum likelihood estimate of p is s/n.

You can play around with that yourself with a bit of R code as below:

p <- 0.39
ns <- seq(1,100)
est <- sapply(ns, function(n) rbinom(1,n,p)/n)

opar <- par(mar=c(5,4,2,4)+0.1)
plot(ns, est, ylim=c(0,1),
     ylab='Maximum likelihood estimates (s/n)',
     xlab='Number of tosses')
abline(h=p)
mtext(paste(" p = ", p), at=p, side=4, las=1)
par(opar)

The parameter is fixed – in the example above as 0.39 – and the entire uncertainty is in the estimate.  That uncertainty decreases with the number of tosses – if you want to you could play around with confidence intervals and see this.  I won’t do this here but instead use credible intervals since I used confidence intervals in the previous post.

Anyway, as I said, the parameter is fixed but unknown.  In a “frequentist” analysis we would really consider the parameter fixed and consider the estimate random.  In a “Bayesian” analysis, on the other hand, we would still consider the parameter random.  We would just learn more and more of the distribution of it as we toss the coin.

Whether you can consider an unknown but fixed parameter (uncertainty due to lack of knowledge) as a stochastic variable (uncertainty due to randomness) is a philosophical question more than a mathematical one, and you are welcome to join in on the frequentist vs Bayesianist fight any day, I am sure.

I’m going to do a Bayesian analysis here, since I did a frequentist analysis in the confidence interval post, but I’ll be back to a frequentist analysis in the next post.

Priors and posteriors

Ok, so to analyse this system in a Bayesian way, we need to model the parameter p as a stochastic variable.  We do this by giving it a prior distribution \pi(p).  This should capture the uncertainty we have about p before we start tossing the coin.

In a frequentist analysis we have no way of quantifying our intuition about p, such as we expect it to be closer to 0.5 than to 0 or 1.

On the other hand, because we need to provide a prior, we have no way in a pure Bayesian analysis of claiming complete prior ignorance.  There is a concept of “uninformative priors” or “reference priors”, but claiming ignorance in Bayesian analysis is not as easy as you might think.  Anyway, that is way beyond the scope of this blog post, so I’ll be on with the analysis.

For mathematical convinience we prefer something called conjugate priors, and for Bernoulli distributions that is the beta distribution.  Don’t worry about what a conjugate prior is, it is not essential for the analysis here.

The beta distribution is parameterised with two “hyper parameters” \alpha and \beta and has density \pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1} with normalisation constant 1/B(\alpha,\beta) where of course

B(\alpha,\beta)=\int_0^1 p^{\alpha-1}(1-p)^{\beta-1}\,\mathrm{d}p.

Depending on \alpha and \beta the beta distribution can take various shapes.  Whenever \alpha=\beta the density is symmetric and (when \alpha=\beta\geq 1) with mode at 0.5, so if your prior belief is that coins are likely to be unbiased, then you want equal \alpha and \beta.  The greater the parameters, the more “pointy” the density is around the mode, so the higher \alpha and \beta the stronger you make your prior belief about the distribution of p.  With \alpha=\beta=1 you get a uniform distribution.

You can try it out with this R code

plot.beta <- function(a,b) {
  p <- seq(0,1,length.out=100)
  d <- dbeta(p,a,b)
  plot(p,d,
       main=substitute(paste(alpha==a, " and ", beta==b),
                       list(a=a,b=b)),
       type='l', xlab='', ylab='')
}
opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.beta(1,1)
plot.beta(2,2)
plot.beta(4,3)
plot.beta(2,6)
par(opar)

The prior distribution is capturing the knowledge we have about p (or think we have about it) before we do any experiments by tossing the coin.  Once you’ve picked one that captures your intuition about how you would expect p to be, the trick is to use the experiments to change the distribution of p to now capture both your prior expectation and the knowledge you have gained about p from doing your experiments.

Let \mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s} denote the likelihood of p when we observe s heads out of n tosses, i.e. \mathrm{lhd}(s,n;p) is the probability of s heads out of n tosses conditional on p.  The joint probability of s, n and p is then \mathrm{lhd}(s,n;p)\pi(p) and the posterior probability for p, P(p) is proportional to this P(p)\propto \mathrm{lhd}(s,n;p)\pi(p).  This is always the case; the posterior is always proportional to the prior times the likelihood, a result you might know as Bayes’ rule.

The posterior probability is interesting because it gives you a distribution for p that depends on the results of your experiment.

In many way this posterior does what the confidence intervals does in frequentist analysis.  It captures the parameter estimate and the uncertainty in it.  The difference here is that in the Bayesian approach the parameter really is considered a random parameter, distributed as the prior before the experiment and as the posterior after the experiment.

Okay, since \mathrm{lhd}(s,n;p)\propto p^s(1-p)^{n-s} and \pi(p)\propto p^{\alpha-1}(1-p)^{\beta-1} the posterior is P(p) \propto p^s(1-p)^{n-s}p^{\alpha-1}(1-p)^{\beta-1}=p^{s+\alpha-1}(1-p)^{n-s+\beta-1}.

So if we start with a \mathrm{beta}(\alpha,\beta) prior and observe s heads in n trials, we end up with a \mathrm{beta}(\alpha+s,\beta+n-s) posterior.  It is this nice relationship that makes the beta distribution a conjugate prior, in case you are wondering.

When coming up with the prior, you can therefore think of \alpha and \beta as “pseudo counts” of heads and tails.  Results you never actually saw, but they behave just as counts of head and tail.

We can try out the Bayesian analysis like this:

plot.analysis <- function(a,b,p,n) {

  # get prior
  x <- seq(0,1,length.out=1000)
  prior <- dbeta(x,a,b)

  # perform "experiment"
  s <- rbinom(1,n,p)

  # get likelihood
  lik <- dbeta(x,s,n-s)

  # get posterior
  post <- dbeta(x,a+s,b+n-s)

  # plot results...
  plot(x, post, main='Prior, likelihood and posterior',
       type='l', xlab='', ylab='')
  lines(x, lik, col='blue')
  lines(x, prior, col='red')
  abline(v=p, lty='dashed')
}

opar <- par(mfrow=c(2,2), mar=c(2,2,2,2)+0.1)
plot.analysis(2,2,0.39,5)
plot.analysis(2,2,0.39,10)
plot.analysis(2,2,0.39,50)
plot.analysis(2,2,0.39,100)
par(opar)

Here I’m plotting the prior – where I’ve chosen a beta(2,2) prior – in red, the likelihood in blue and the posterior in black.

When we have only done a few experiments, the prior will dominate and we see the posterior as somewhere in between the likelihood and the prior.  With many experiments, the likelihood dominates and the posterior resembles it more and more.  This is as it should be, considering the the more data we have from experimenting, the more we should trust the experiments over our previous intuition.

Credible intervals

With a posterior in hand, we have all the information about our parameter that we can get in a Bayesian framework, but the posterior is not always easy summarise (or visualise in high dimensions) so we can also reduce it to an interval in which the parameter is likely to be.

This is like the confidence intervals in frequentist analysis, but this time the parameter really is stochastic so we have a fixed interval, the credible interval, where there is a, say, 95% chance that the parameter falls in.

To get the 95% credible interval, we just get the 0.025 and 0.975 quantiles of the posterior.

qbeta(c(0.025,0.975),a+s,b+n-s)

If the experimental results were really fixed in stone (so the number of heads and number of tails were fixed) and the coin bias random, this would be a fixed interval where the parameter p would fall within in 95% of the cases.

# data  -- assumed fixed forever...
n <- 100
s <- 47

# prior
a <- 2 ; b <- 2

# credible interval
CI <- qbeta(c(0.025,0.975), a+s, b+n-s)

# samples of p
ps <- rbeta(100, a+s, b+n-s)

plot(ps, ylim=c(0,1), main='Credible interval and samples of p',
     ylab='Samples')
abline(h=CI, lty='dashed')

Of course, in this case this is really only something we pretend.  It is in fact p that is fixed and we can re-do the experiment whenever we want, and in that case we would see a fixed p and a random credible interval, so there really is very little difference between credible intervals and confidence intervals in practice…

# experiments
p <- 0.39
n <- 100
s <-
ns <- seq(1,100, by=5)
heads <- sapply(ns, function(n) rbinom(1,n,p))

# prior
a <- 2 ; b <- 2

# point estimate
modes <- mapply(function(n,s) (a+s)/(n+a+b), ns, heads)

# credible interval
lower <- mapply(function(n,s) qbeta(0.025,a+s,b+n-s), ns, heads)
upper <- mapply(function(n,s) qbeta(0.975,a+s,b+n-s), ns, heads)

plot(ns, modes, ylim=c(0,1),
     main='Point estimates and credible intervals',
     xlab='Number of tosses', ylab='Estimate')
lines(ns, lower)
lines(ns, upper)
abline(h=p, lty='dashed')

Ok, enough about this setup.

In a later post I will consider an actual random p parameter and see how we can estimate its distribution from coin tosses.  Stay tuned!

221-222=-1

True positives and false positives

Thursday, July 2nd, 2009

Following up on my rant on p-values, I want to say something about true and false positives in a classical statistical hypothesis test.

Such a test works as follows: we observe a value – assumed to be from some stochastic distribution – and calculate a p-value.  We then check if that p-value is below some significance level – typically 5% – and if it is we say that it is “significant” (or a positive) and if it is not we say that it is “not significant” (or a negative).

That is all there is to it.

Now, the p-values says something about the probability of false positives, those observations that are really sampled from the null distribution but are judged positives by the test.  With a 5% significance level, we expect that 5% of values truly sampled from the null distribution will be significant.  In other words, if we make a number of observations, and all our observations are from the null distribution, then we should get positives for about 5% of our tests.

Of course we do not really believe that all our observations are from the null distribution.  If we did, we wouldn’t make any tests.  What we actually believe is that we have some mixture of values from either the null distribution, P_0 or some alternative distribution, P_1, so our values are drawn from \pi_0P_0 + \pi_1P_1 where \pi_0 is the probability that the sample is from the null distribution and \pi_1=1-\pi_0 is the probability that it is from the alternative distribution.

An example: association mapping

In association mapping we test genetic markers, to figure out if any marker is associated with some a disease.  We do not believe that all markers are associated with the disease, nor do we believe that none of them are.  In the mixture above, \pi_0 would be the (prior) probability that any given marker is not associated with the disease, while \pi_1 would be the probability that it is associated with the disease.

We will simplify the setup a bit.  We will consider a marker where one allele is found in 40% of the population, say that those without that marker have a fixed disease risk while those with the marker has an increased risk, a genetic relative risk, GRR, such that if the wild types have risk r then the mutants (those with the 40% allele) have the risk GRR\cdot r.  We sample N individuals, categorise them according to allele and according to disease, and then make a \chi^2 test for association.

create.sample <- function(N, mutant.freq, wt.risk, GRR) {
  # Sample N individuals where N*mutant.freq are mutants.  Assign
  # cases and controls status based on the wildtype risk wt.risk and
  # the mutant risk based on the genetic relative risk GRR*wt.risk.
  mutant.status <- runif(N) < mutant.freq
  disease.status <- sapply(mutant.status,
                           function (m)
                           ifelse(m,
                                  runif(1) < GRR*wt.risk,
                                  runif(1) < wt.risk))
  ftable(mutant.status, disease.status)
}

data <- create.sample(2000,0.4,0.05,2)
chisq.test(data)

So here I sample 2000 individuals, say that the wildtype risk is 5% and that the mutants have a GRR of 2, so twice the risk of the disease.

Since this sample has a GRR of 2 – twice the risk for mutants – we are sampling from the alternative distribution.  So if the p-value is significant we have a true positive – a signal where there should be one – while if the p-value is not significant we have a false negative – no signal where there should be one.

If, instead, we sample with a GRR of 1 – the mutant risk is the same as the wild type – a significant value would be a false negative while a non-significant p-value would be a true negative.

With me so far?  Good.

Now we can set up the mixture from above.

sample.pvalue <- function(N,mutant.freq,wt.risk,GRR) {
  chisq.test(create.sample(N,mutant.freq,wt.risk,GRR))$p.value
}

sample.mixture <- function(true.risk, n, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < 0.05
  ftable(significant, true.or.false)
}

The real setup for association mapping is somewhat more complex, since there we never sample new individuals for each marker – so the tests are not independent – and the allele frequency varies for each marker and so on.  No matter, this setup is good enough for a blog.

Sampling true and false positives and negatives

Ok, now we can try sampling from the mixture.

First, we can try \pi_1=0.  That is, we can try sampling from a distribution of markers that are guaranteed not to be associated with the disease.

> sample.mixture(0, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE
significant
FALSE                       964
TRUE                         36

We only get “false” signals – since none of the markers are associated with the disease, but we still get some positives.  False positives, of course.  We expect about 5% – so 50 since we sample 1000 markers.  We get 36 instead of 50, since this is a random process, but you get the point.

If, instead, we sample with \pi_1=1 we only see associated markers, but we still test them, so some of them might end up being classified as negatives.  False negatives, of course.

> sample.mixture(1, 1000, 2000, 0.4, 0.05, 2)
            associated TRUE
significant
FALSE                       10
TRUE                       990

The significance level doesn’t tell you anything about the expected number of false negatives under the alternative distribution.  In a classical hypothesis test, the alternative distribution is completely ignored.  Only the null distribution matters.  This is one of the reasons I don’t particularly like this type of tests, but that is a different story.

The point is just, that even with all values “true” we still see some negatives.

With a mixture of associated and not associated markers, say 50%/50%, we expect both associated and not associated markers and both positives and negatives, of course.

> sample.mixture(0.5, 1000, 2000, 0.4, 0.05, 2)

 associated FALSE TRUE
significant
FALSE                       490    5
TRUE                         20  485

We want to find as many true positives as possible, while getting as few false positives as possible.  We can only see the signifcance status, of course, so we cannot tell which of the positives are true or false.

In the case above, we find as significant 485 of the associated markes – and miss five of them – and we get only 20 false positives.  Not bad, but then, 50% of the markers were true associations to begin with.

What if \pi_1=0.1?

> sample.mixture(0.1, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    876    2
TRUE                      38   84

or \pi_1=0.05?

> sample.mixture(0.05, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant
FALSE                    893    2
TRUE                      49   56

We still do okay at picking up the associated markers.  In both cases we only miss two of them (but of course the actual number depends on the randomness in the sample).  We just see fewer of the true associations in the tests.

The ratio of true to false positives changes as well.  We still only expect 5% of the non associated markers to be significant, but there are just more of them now, and if we reduce the frequency of associated markers to 1% we now see more false than true positives:

> sample.mixture(0.01, 1000, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    953    0
TRUE                      43    4

In a real association mapping study, we don’t expect 1% of the markers to be associated with the disease.  If we test a million markers, we don’t expect more than a few tens to a few hundreds to really be true associations, so \pi_1 is pretty low indeed.

With a significance value of 5% the true positives would completely drown in the sea of false positives.

Power

Formally, statistical power is the probability of getting a significant p-value when the sample is from the alternative distribution.  It is the dual to the significance value.

The significance value doesn’t say anything about what happens to observations from the alternative distribution.  The power doesn’t say anything about what happens to values from the null distribution.

It just tells you how likely it is that you detect values from the alternative distribution as significant.

In our example above, the power is pretty good.  We recognize pretty much all of the associated markers as significant.

With a sample size of 2000, a high risk allele frequency and a relative risk of 2, we get a pretty strong signal.

Our only real problem is that we still detect a lot more false associations than true associations.

Boosting the power, which essentially amounts to increasing the sample size since that is the only variable we are in control of, doesn’t help on this at all!

If you increase the sample size, you still get 5% of the false associations as significant.  That is just how the significance value works.

The only way to reduce the number of false positives is to lower the significance value.  If you pick 1% you only get 1% of the false associations as significant.  With 0.1% you only get 0.1%.

This is why we do multiple test correction.

Multiple test correction

Multiple test correction really just means lowering the significance threshold.  There are different ways of doing it, but it pretty much all amounts to figuring out how much to reduce the significance threshold down to a level where you expect few false positives.

Let’s try it out.  We update our code

sample.mixture <- function(true.risk, n, sig.level, N, mutant.freq, wt.risk, GRR) {
  associated <- runif(n) < true.risk
  p.values <- sapply(associated,
                     function(t)
                     ifelse(t,
                            sample.pvalue(N,mutant.freq,wt.risk,GRR),
                            sample.pvalue(N,mutant.freq,wt.risk,1)))
  significant <- p.values < sig.level
  ftable(significant, associated)
}

and sample again.  Say with a significance level of 0.01%:

> sample.mixture(0.01, 1000, 0.0001, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    989    6
TRUE                       0    5

We reduce the number of false positives – which is good – but we also reduce the number of true positives – which is bad.

If we reduce the significance value, we also make it harder for samples from the alternative distribution to get p-values below the threshold.  We reduce the power.

The example above is not actually that bad.  We still find half of the true associations.  But then, a significance level of 0.01% is actually a bit high for a genome wide association study.  If, there, we test 1 million markers, and are willing to get around 1 false positives, we should use a significance value of one in a million.

> sample.mixture(0.01, 1000, 1e-6, 2000, 0.4, 0.05, 2)
            associated FALSE TRUE
significant                     
FALSE                    986   12
TRUE                       0    2

The more tests you make, the lower a significance value you need, if you want to keep the expected number of false positives constant.

Of course, what you really want is not to keep the number of false positives fixed but rather to get a good ratio of false to true positives, but I’ll leave that for another post.

The point here is just that if you want to keep the number of false positives down, you will end up reducing your power as well.

Of course, now you have a reason for increasing the sample size.  That will make the alternative distribution less similar to the null distribution and therefor more likely to get small p-values.  It still won’t do anything to the distribution of p-values from the null distribution, but it will change the distribution of p-values from the alternative distribution.

I wanted to say something about the case where the “actual” null distribution is not really the mathematical null distribution, but this post is getting pretty long already, so I think I’ll leave that for another post, so stay tuned…

183-186=-3

Minimisation

Sunday, June 7th, 2009

Just for something to do on a lazy Sunday afternoon – where the only “excitement” was voting for the EU parliament (and the law concerning the succession to the Danish throne that won’t be relevant for the next two generations) – I played with R a little bit.

I should be reading three Master’s theses (I’m censor on three defenses later this month) but I got to bed too late yesterday and feel too tired and lazy to do a proper job on that…

Anyway, I decided to play with the gradient descent method for function minimisation. In my machine learning class I told my students that they should find a library for minimising functions rather than the simple gradient descent method, ’cause that is what I would normally do, but never being too keen on practising what I preach, I decided to experiment a bit with it.

Just for a function of a single value, to keep it simple, but still.

I picked the function f(x) = -x^3 + 10x^2 + 3x – that’s as good as any and I can analytically find the minima so I can easily check the results.  Well, one minimum is minus infinity, but I am not interested in that.

To get that minima, just set the derivative f'(x) = -3x^2 + 20x + 3 to zero and pick the negative solution (the positive is the local maximum).

> f <- function(x) -x**3 + 10*x**2 + 3*x
> df <- function(x) -3*x**2 + 20*x + 3
> roots <- polyroot(c(3,20,-3))
> roots
[1] -0.1467688+0i  6.8134355+0i

So with that out of the way, we can get started.

The Newton-Raphson method

The gradient descent method is a simplification of the Newton-Raphson method. The idea behind the latter is to do a Taylor expansion of f(x) around some local minimum x', f(x') \approx f(x)+f'(x)(x'-x) + \frac{1}{2}f''(x)(x'-x)^2 where we cut the expansion short after the quadratic term.

Take the derivative, with respect to x’-x, on both sides 0 \approx f'(x) + f''(x)(x'-x) and we get x' \approx x - \frac{f'(x)}{f''(x)}.  The goal now is to replace \approx with =, that is find x=x', and we do that iteratively, starting in some candidate x_0 and then at iteration i set x_i = x_{i-1}-\frac{f'(x_{i-1})}{f''(x_{i-1})}.

Repeat until you reach a fixed point, or at least until x_i is reasonably close to x_{i-1} or f'(x_i) is reasonably close to zero.

In R:

NewtonRaphson <- function(x0, df, ddf, n=100) {
  steps <- matrix(0, nr=n, nc=3)
  steps[1,] <- c(x0, df(x0), ddf(x0))
  x <- x0
  for (i in 2:n) {
    x <- x - df(x)/ddf(x)
    steps[i,] <- c(x, df(x), ddf(x))
  }
  steps
}

where I just run for n iterations and keep track of the derivatives along the way so I can inspect the results later.

This method has the nice property that if you start close to a minimum, the distance to that minimum drops off quadratically as a function of iterations, so it converges pretty quickly.

> ddf <- function(x) -6*x + 20
> nr.steps <- NewtonRaphson(0, df, ddf)
> head(nr.steps)
           [,1]          [,2]     [,3]
[1,]  0.0000000  3.000000e+00 20.00000
[2,] -0.1500000 -6.750000e-02 20.90000
[3,] -0.1467703 -3.129221e-05 20.88062
[4,] -0.1467688 -6.737721e-12 20.88061
[5,] -0.1467688  4.440892e-16 20.88061
[6,] -0.1467688 -8.881784e-16 20.88061

In this case, we are converged after only four steps if we start in x_0=0.

The gradient descent method

The gradient descent method doesn’t use the second derivative, but replaces \frac{1}{f''(x)} with a constant \lambda. Useful if you don’t have an expression for the second derivative (or if it is costly to compute).

The iteration goes like x_i = x_{i-1}-\lambda f'(x_{i-1}) and the R code looks like this:

GradientDescent <- function(x0, df, lambda, n=100) {
  steps <- matrix(0, nr=n, nc=2)
  steps[1,] <- c(x0, df(x0))
  x <- x0
  for (i in 2:n) {
    x <- x - lambda*df(x)
    steps[i,] <- c(x, df(x))
  }
  steps
}

From the Newton-Raphson run above, we saw that f''(x) was pretty close to 20, so we can try running with \lambda=1/20:

> gd.steps <- GradientDescent(0, df, 1/20)
> head(gd.steps)
           [,1]          [,2]
[1,]  0.0000000  3.000000e+00
[2,] -0.1500000 -6.750000e-02
[3,] -0.1466250  3.003328e-03
[4,] -0.1467752 -1.321765e-04
[5,] -0.1467686  5.819939e-06
[6,] -0.1467688 -2.562555e-07

where again we converge pretty quickly.

Of course, if we don’t know f''(x) close to the minimum, or if it varies significantly around the minimum, we might not be so lucky that we can pick a good \lambda.

and if we take too large steps, say \lambda=1/10 we might not converge at all but keep over-shooting in the update.

One possible solution is to use a small \lambda but add a momentum to the iteration step, so it becomes x_i=x_{i-1}+m_i where m_i=-\lambda f'(x) + \mu m_{i-1} and where 0\leq\mu\leq 1.  For \mu=0 this is just the plain old gradient descent method, but for \mu>0 the idea is that if we take a large step, the next one should probably also be large so the momentum increases, and when we start making smaller steps it decreases again.

GradientDescentMomentum <- function(x0, df, lambda, mu, n=50) {
  steps <- matrix(0, nr=n, nc=3)
  steps[1,] <- c(x0, df(x0), 0)
  x <- x0
  m <- 0
  for (i in 2:n) {
    m <- -lambda*df(x) + mu*m
    x <- x + m
    steps[i,] <- c(x, df(x), m)
  }
  steps
}

Of course, now we are left with two rather than one parameter to the method that we need to come up with a guess for.

We still have the problem that we won’t converge if we make a guess for \lambda that is too large, and in my experiments we do not gain a whole lot from the momentum if we pick it too small either.

Plus the \mu term has to be chosen carefully as well

so I’m a bit underwhelmed by this idea.

Ok, enough computer time for me today, I need to rest my RSI, so I’ll stop here…

158-161=-3