jarekduda
jarekduda OP t1_iypsbet wrote
Reply to comment by Red-Portal in [R] SGD augmented with 2nd order information from seen sequence of gradients? by jarekduda
Indeed BFGS seems the closest to my approach (OGR), but it is relatively costly: needs many matrix products per step, uses only a few gradients per step, and they have the same weights.
In contrast, OGR is literally online linear regression of gradients, per step updates 4 averages and e.g. use eigendecompositon (can be done cheaper), uses exponentially weakening weights - focusing on local situation, using all previous gradients ... also should be compatible with slow evolution of considered local interesting subspace.
jarekduda OP t1_iyq9dj9 wrote
Reply to comment by SufficientStautistic in [R] SGD augmented with 2nd order information from seen sequence of gradients? by jarekduda
It is regularized Gauss-Newton, which is generally quite suspicious: approximates Hessian with positive defined ... for extremely non-convex function.
How does it change the landscape of extrema?
Is it used for NN training? K-FAC uses kind of related Fisher information approximation to positive defined.