Just for something to do on a lazy Sunday afternoon – where the only “excitement” was voting for the EU parliament (and the law concerning the succession to the Danish throne that won’t be relevant for the next two generations) – I played with R a little bit.
I should be reading three Master’s theses (I’m censor on three defenses later this month) but I got to bed too late yesterday and feel too tired and lazy to do a proper job on that…
Anyway, I decided to play with the gradient descent method for function minimisation. In my machine learning class I told my students that they should find a library for minimising functions rather than the simple gradient descent method, ’cause that is what I would normally do, but never being too keen on practising what I preach, I decided to experiment a bit with it.
Just for a function of a single value, to keep it simple, but still.
I picked the function
– that’s as good as any and I can analytically find the minima so I can easily check the results. Well, one minimum is minus infinity, but I am not interested in that.

To get that minima, just set the derivative
to zero and pick the negative solution (the positive is the local maximum).
> f <- function(x) -x**3 + 10*x**2 + 3*x
> df <- function(x) -3*x**2 + 20*x + 3
> roots <- polyroot(c(3,20,-3))
> roots
[1] -0.1467688+0i 6.8134355+0i
So with that out of the way, we can get started.
The Newton-Raphson method
The gradient descent method is a simplification of the Newton-Raphson method. The idea behind the latter is to do a Taylor expansion of
around some local minimum
,
where we cut the expansion short after the quadratic term.
Take the derivative, with respect to x’-x, on both sides
and we get
. The goal now is to replace
with
, that is find
, and we do that iteratively, starting in some candidate
and then at iteration
set
.
Repeat until you reach a fixed point, or at least until
is reasonably close to
or
is reasonably close to zero.
In R:
NewtonRaphson <- function(x0, df, ddf, n=100) {
steps <- matrix(0, nr=n, nc=3)
steps[1,] <- c(x0, df(x0), ddf(x0))
x <- x0
for (i in 2:n) {
x <- x - df(x)/ddf(x)
steps[i,] <- c(x, df(x), ddf(x))
}
steps
}
where I just run for
iterations and keep track of the derivatives along the way so I can inspect the results later.
This method has the nice property that if you start close to a minimum, the distance to that minimum drops off quadratically as a function of iterations, so it converges pretty quickly.
> ddf <- function(x) -6*x + 20
> nr.steps <- NewtonRaphson(0, df, ddf)
> head(nr.steps)
[,1] [,2] [,3]
[1,] 0.0000000 3.000000e+00 20.00000
[2,] -0.1500000 -6.750000e-02 20.90000
[3,] -0.1467703 -3.129221e-05 20.88062
[4,] -0.1467688 -6.737721e-12 20.88061
[5,] -0.1467688 4.440892e-16 20.88061
[6,] -0.1467688 -8.881784e-16 20.88061
In this case, we are converged after only four steps if we start in
.
The gradient descent method
The gradient descent method doesn’t use the second derivative, but replaces
with a constant
. Useful if you don’t have an expression for the second derivative (or if it is costly to compute).
The iteration goes like
and the R code looks like this:
GradientDescent <- function(x0, df, lambda, n=100) {
steps <- matrix(0, nr=n, nc=2)
steps[1,] <- c(x0, df(x0))
x <- x0
for (i in 2:n) {
x <- x - lambda*df(x)
steps[i,] <- c(x, df(x))
}
steps
}
From the Newton-Raphson run above, we saw that
was pretty close to 20, so we can try running with
:
> gd.steps <- GradientDescent(0, df, 1/20)
> head(gd.steps)
[,1] [,2]
[1,] 0.0000000 3.000000e+00
[2,] -0.1500000 -6.750000e-02
[3,] -0.1466250 3.003328e-03
[4,] -0.1467752 -1.321765e-04
[5,] -0.1467686 5.819939e-06
[6,] -0.1467688 -2.562555e-07
where again we converge pretty quickly.
Of course, if we don’t know
close to the minimum, or if it varies significantly around the minimum, we might not be so lucky that we can pick a good
.

and if we take too large steps, say
we might not converge at all but keep over-shooting in the update.

One possible solution is to use a small
but add a momentum to the iteration step, so it becomes
where
and where
. For
this is just the plain old gradient descent method, but for
the idea is that if we take a large step, the next one should probably also be large so the momentum increases, and when we start making smaller steps it decreases again.
GradientDescentMomentum <- function(x0, df, lambda, mu, n=50) {
steps <- matrix(0, nr=n, nc=3)
steps[1,] <- c(x0, df(x0), 0)
x <- x0
m <- 0
for (i in 2:n) {
m <- -lambda*df(x) + mu*m
x <- x + m
steps[i,] <- c(x, df(x), m)
}
steps
}
Of course, now we are left with two rather than one parameter to the method that we need to come up with a guess for.
We still have the problem that we won’t converge if we make a guess for
that is too large, and in my experiments we do not gain a whole lot from the momentum if we pick it too small either.
Plus the
term has to be chosen carefully as well

so I’m a bit underwhelmed by this idea.
Ok, enough computer time for me today, I need to rest my RSI, so I’ll stop here…
–
158-161=-3