jnez71 t1_ivp6ril wrote on November 9, 2022 at 4:30 PM

Reply to comment by CPOOCPOS in [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS

Oh I should add that from a nonconvex optimization perspective, the volume-averaging could provide heuristic benefits akin to GD+momentum type optimizers. (Edited my first comment to reflect this).

Try playing around with your idea in low dimensions on a classical computer to get a feel for it first. Might help you think of new ways to research it.

jnez71 t1_ivp3sju wrote on November 9, 2022 at 4:10 PM

Reply to comment by CPOOCPOS in [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS

Hm, there may be a way to exploit that cheapened average gradient computation to still tell you curvature, which can help a lot.

I am reminded of how a covariance matrix is really just composed of means: cov[g,g] = E[gg'] - E[g]E[g'] (where ' is transpose). If g is distributed as the gradients in your volume, I suspect that cov[g,g] is related to the Hessian, and you can get that covariance with basically just averages of g.

More intuitively I'm thinking, "in this volume, how much on average does the gradient differ from the average gradient." If your quantum computer really makes that volume averaging trivial, then I suspect someone would have come up with this as some kind of "quantum Newton's method."

I think that's all I got for ya. Good luck!

jnez71 t1_ivoyij3 wrote on November 9, 2022 at 3:35 PM

Reply to [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS

I suppose it is a bit closer to a secant method like BFGS, which approximates the Hessian required for a Newton step. In other words, these methods use a linear combination of adjacent gradient computations to estimate curvature, which enables more effective updates. The combination is not an average though, and also they integrate gradient computations over successive iterations rather than stopping at each iteration to compute a bunch within some volume.

I don't think your proposed volume-averaging has any theoretical utility as a convex optimizer, especially because there are much better well-known things to do with adjacent gradient computations. The closest common practice I can think of to averaging are GD+momentum type optimizers, which lowpass-filter the GD dynamic along its trajectory. These provide heuristic benefits in the context of nonconvex optimization.

Do you have any links about the volume-averaging or was it just a random thought? Also, make sure not to confuse "averaging gradients from different points in parameter space" with "averaging gradients from different sample losses at the same point in parameter space" (called "batching").