Submitted by jarekduda t3_zb7xjb in MachineLearning
I am working on 2nd order optimizer with Hessian estimator from online MLE linear regression of gradients, mostly updating 4 exponential moving averages: of (theta, g, theta*g, theta^2). Here is simple 2D Beale function example, after 30 steps it gets ~50x smaller values than momentum: https://github.com/JarekDuda/SGD-OGR-Hessian-estimator/raw/main/OGR%20beale.png
I wanted to propose a discussion about various 2nd order approaches using only gradients - I am aware of: conjugated gradients, quasi-Newton especially L-BFGS, Gauss-Newton.
Any others? Which one seems the more practical to expand for NN training?
How to transform them to high dimension? I thought about building 2nd order model on updated e.g. 10 dimensional locally interesting space e.g. from online PCA of gradients, and in the remaining directions still use e.g. momentum.
How to optimally use such estimated Hessian - especially handle very low and negative eigenvalues? (abs, div&cut above)
Slides with gathered various approach (any interesting missing?): https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf
Derivation of this OGR Hessian estimator: https://arxiv.org/pdf/1901.11457
Red-Portal t1_iyprcgo wrote
Isn't what you implemented more or less a variant of BFGS? Stochastic BFGS is very well known to not work very well on deep neural networks.