An overview of gradient descent optimization algorithms

chestervonwinch · on June 21, 2016

Why are algorithms other than gradient descent (e.g., bfgs, newton's, conjugate gradient, etc.) always seemingly absent from discussions on fitting neural nets? Are there not batch / stochastic versions of these methods? Is it simply ease of understanding / implementation? I seem to only ever see "gradient descent + tricks", and I wonder what the reason for this is.

imurray · on June 21, 2016

Stochastic gradient descent methods are the most-used for neural net fitting, so the most discussed.

However, the methods you mention are sometimes used too. For the closest thing to Newton's method applied to neural nets, search for "Hessian-free optimization". Hacky minibatch updates based on batch methods have been tried, and can work ok http://ai.stanford.edu/~quocle/LeNgiCoaLahProNg11.pdf . Any method could also potentially be used with growing batch sizes, https://papers.nips.cc/paper/4308-statistical-tests-for-opti...

Minibatches are good for parallelizing computation, especially on GPUs. And batch methods can be less of a fiddle to tune. So the alternatives you mention are popular in some commercial settings, for cost functions with nasty curvature, or to make it easier to write fast code. I guess most people's experience is that the dead simple SGD methods often work well though, and that's why they keep returning to them.

nimish · on June 21, 2016

I wonder what the quasi Newton methods would look like if adapted to this framework

imurray · on June 21, 2016

BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-memory" variant. Any such method can be used with the two references I gave above. One could also attempt to approximate each of its internal computations to within some tolerance with minibatches. Work in that sort of direction here: https://ei.is.tuebingen.mpg.de/publications/mahhen2015

Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.

davidkretch · on June 21, 2016

> Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.

Not entirely to your point, but related, my understanding is that is no best at everything: methods that work well in one field won't necessarily work in others. See https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

imurray · on June 21, 2016

I agree. But part of it is sociology: people use what their friends use.

goerz · on June 21, 2016

I always assumed that when people talk about gradient-descent, they actually mean L-BFGS, i.e. a quasi-Hessian. After all, it can be used as a black-box algorithm that only needs to be told the gradient. At least in quantum optimization, the "simple" (non-quasi-Newton) gradient approach almost never works, whereas L-BFGS does just fine. So, I'm confused why in machine learning, the situation would be different, or why one would choose not to use the "for free" enhancement that L-BFGS provides.

imurray · on June 27, 2016

In machine learning "evaluating the gradient" means sweeping over the whole training set. For simple models, stochastic gradient descent will have found a good fit after one or two passes over the dataset. (For neural nets, tens to hundreds of passes over the dataset may be needed.) It's hard for batch L-BFGS to do much in the same number of gradient and function evaluations. No one uses batch steepest gradient descent, which would be the worst of all worlds.

The discussion above was about making stochastic or mini-batch versions of algorithms like L-BFGS. It's possible, but not entirely straightforward, which is why "dumb" stochastic gradient descent is often used.

enthdegree · on June 21, 2016

Because it is extremely easy to conceptualize and implement, and it works (usually poorly) with few assumptions.

graycat · on June 21, 2016

bfgs -- Broyden, Fletcher, Goldfarb, Shanno update for quasi-Newton?

howeman · on June 21, 2016

graycat · on June 21, 2016

There were some standard, old techniques I didn't see.

For the set of real numbers R, a positive integer n, and a function f: R^n --> R, we seek x in R^n to minimize f(x). Let z in R^n denote the value of x that minimizes f(x).

Borrowing from D. Knuth's mathematical typesetting system TeX, we let x_i denote component i of x. More generally we use the underscore to denote subscript.

Assume that function f is differentiable and let D_x f(x) be the derivative, that is, the gradient, that is, the vector where component i is the partial derivative if f with respect to component i of x.

(1) Line search. Suppose we have iterations j = 1, 2, ... where x^j is our jth estimate of z.

Then, for positive real number s, our step size, we can the usual gradient descent is

x^{j + 1} = x^j - s D_x f(x^j)

Well, with this approach, it need not be that

f(x^{j + 1}) < f(x^j)

So, an improvement is, for each iteration j, to do a line search to find the set size s. If f is convex, then there is a simple, standard approach to how to adjust s on this like search.

(2) Conjugate Gradients. Gradient descent is vulnerable to a lot of movement in directions that are nearly orthogonal to the best direction. The classic technique conjugate gradients after n iterations improves this situation.

(3) Newton Iteration. The simple Newton iteration for, say, square root generalizes to a function of n variables but, of course, requires finding second derivatives of f.

(4) Quasi-Newton. Look up quasi-Newton that estimates the second derivatives of Newton iteration from the gradients from the iterations.

imurray · on June 21, 2016

For a mathy academic survey of gradient descent methods (from this month, June 2016), I recommend: https://arxiv.org/abs/1606.04838

(It was submitted by someone else a few days ago, but saw no traction.)

Bootvis · on June 21, 2016

If I understand HN policy correctly, interesting links maybe reposted, which I know did. Hopefully, it will get some traction now.

gjm11 · on June 21, 2016

See also: ftp://ftp.sas.com/pub/neural/kangaroos.txt (less mathematics, more kangaroos).

jacquesm · on June 21, 2016

Fun read. Thanks!

antman · on June 21, 2016

Very nice content introduction but painful for me to read with the light grey letters on the white background.

avodonosov · on June 21, 2016

Yes, and this contrast fails WCAG AAA validation. That's a harmful fashion currently with gray texts.

SubiculumCode · on June 21, 2016

As someone who is trying to educate himself on concepts in machine learning, I appreciated the 101 overview on gradient descent optimizations. I love how most of the approaches to these optimizations are relatively straightforward, conceptually, to understand, even if their implementations and rigorous proofs will require more work to get through.

dharma1 · on June 21, 2016

I came across this yesterday - proposing non-negative matrix factorisation as an alternative to backpropagation

https://arxiv.org/pdf/1605.04639v1.pdf

choosername · on June 21, 2016

I have a theorie that submissions like this get to the frontpage because there are a lot of procrastinators who will always upvote to save for later reading, and they never actually read the articles so they keep saving them. I admit I do.