Why are algorithms other than gradient descent (e.g., bfgs, newton's, conjugate gradient, etc.) always seemingly absent from discussions on fitting neural nets? Are there not batch / stochastic versions of these methods? Is it simply ease of understanding / implementation? I seem to only ever see "gradient descent + tricks", and I wonder what the reason for this is.
Minibatches are good for parallelizing computation, especially on GPUs. And batch methods can be less of a fiddle to tune. So the alternatives you mention are popular in some commercial settings, for cost functions with nasty curvature, or to make it easier to write fast code. I guess most people's experience is that the dead simple SGD methods often work well though, and that's why they keep returning to them.
BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-memory" variant. Any such method can be used with the two references I gave above. One could also attempt to approximate each of its internal computations to within some tolerance with minibatches. Work in that sort of direction here: https://ei.is.tuebingen.mpg.de/publications/mahhen2015
Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.
> Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.
I always assumed that when people talk about gradient-descent, they actually mean L-BFGS, i.e. a quasi-Hessian. After all, it can be used as a black-box algorithm that only needs to be told the gradient. At least in quantum optimization, the "simple" (non-quasi-Newton) gradient approach almost never works, whereas L-BFGS does just fine. So, I'm confused why in machine learning, the situation would be different, or why one would choose not to use the "for free" enhancement that L-BFGS provides.
In machine learning "evaluating the gradient" means sweeping over the whole training set. For simple models, stochastic gradient descent will have found a good fit after one or two passes over the dataset. (For neural nets, tens to hundreds of passes over the dataset may be needed.) It's hard for batch L-BFGS to do much in the same number of gradient and function evaluations. No one uses batch steepest gradient descent, which would be the worst of all worlds.
The discussion above was about making stochastic or mini-batch versions of algorithms like L-BFGS. It's possible, but not entirely straightforward, which is why "dumb" stochastic gradient descent is often used.
There were some standard,
old techniques I didn't see.
For the set of real numbers
R, a positive integer n, and
a function f: R^n --> R, we
seek x in R^n to minimize
f(x). Let z in R^n denote
the value of x that
minimizes f(x).
Borrowing from D. Knuth's
mathematical typesetting
system TeX, we let x_i
denote component i of x.
More generally we use the
underscore to denote
subscript.
Assume that function f is
differentiable and let
D_x f(x) be the derivative,
that is, the gradient,
that is, the vector
where component i is the
partial derivative if f
with respect to component
i of x.
(1) Line search. Suppose
we have iterations
j = 1, 2, ... where x^j is
our jth estimate of z.
Then, for positive real number
s, our step size, we can
the usual gradient descent
is
x^{j + 1} = x^j - s D_x f(x^j)
Well, with this approach, it
need not be that
f(x^{j + 1}) < f(x^j)
So, an improvement is,
for each iteration j, to
do a line search to find
the set size s. If f is
convex, then there is a
simple, standard approach
to how to adjust s on this
like search.
(2) Conjugate Gradients.
Gradient descent is
vulnerable to a lot of
movement in directions
that are nearly orthogonal
to the best direction.
The classic technique
conjugate gradients
after n iterations
improves this situation.
(3) Newton Iteration.
The simple Newton iteration
for, say, square root
generalizes to
a function of n variables but, of course,
requires finding second
derivatives of f.
(4) Quasi-Newton.
Look up quasi-Newton that
estimates the second
derivatives of Newton
iteration from the
gradients from the
iterations.
As someone who is trying to educate himself on concepts in machine learning, I appreciated the 101 overview on gradient descent optimizations. I love how most of the approaches to these optimizations are relatively straightforward, conceptually, to understand, even if their implementations and rigorous proofs will require more work to get through.
I have a theorie that submissions like this get to the frontpage because there are a lot of procrastinators who will always upvote to save for later reading, and they never actually read the articles so they keep saving them. I admit I do.