uncooked-cookie t1_ivpnon9 wrote on November 9, 2022 at 6:19 PM

The gradient doesn’t give you the optimal improvement direction, it gives you a local improvement direction.

make3333 t1_ivqfpxu wrote on November 9, 2022 at 9:19 PM

first degree optimal direction

Difficult_Ferret2838 t1_ivrnegq wrote on November 10, 2022 at 2:34 AM

That doesn't mean anything.

make3333 t1_ivroe1x wrote on November 10, 2022 at 2:41 AM

gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point. in neural nets we do first degree, as if it was a plane. in a lot of other optimization settings they do second order approx to find the optimal direction

Difficult_Ferret2838 t1_ivrom17 wrote on November 10, 2022 at 2:43 AM

>gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point.

No. Gradient descent is first order by definition.

>in a lot of other optimization settings they do second order approx to find the optimal direction

It still isn't an "optimal" direction.

kksnicoh t1_ivtla47 wrote on November 10, 2022 at 2:52 PM

It is optimal in first order :)

Difficult_Ferret2838 t1_ivtprrn wrote on November 10, 2022 at 3:22 PM

Exactly, that is a meaningless phrase.

bluuerp t1_ivpshwu wrote on November 9, 2022 at 6:50 PM

Yes I meant the optimal improvement direction for that point.

Spiritual-Reply5896 t1_iw8yhoi wrote on November 13, 2022 at 9:24 PM

It gives you local improvement direction, but can we straightforwardly think about this metaphora of improvement in 3D and generalize it to thousands of dimensions?

Maybe its a little different question, but do you happen to know where to find research on this topic of generalizability of mathematical operations in interpretable geometrical dimensions to extremely high dimensions? Not looking for theory on vector spaces but on the intuitive aspects

hughperman t1_ivqd886 wrote on November 9, 2022 at 9:03 PM

Consider though, in a linear scheme, taking each gradient step separately is equal the sum of the gradients. Taking the average is equal to the sum of the gradients divided by the number of steps. So you are only adjusting the step by a scale factor of 1/N, nothing more mathemagical.

CPOOCPOS OP t1_ivowysr wrote on November 9, 2022 at 3:25 PM

This sounds similar to what fredditor_1 was explaining. I will look into it!

Thanks a lot

[D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point

bluuerp t1_ivov64y wrote on November 9, 2022 at 3:13 PM