bluuerp t1_ivov64y wrote
The gradient gives you the optimal improvement direction....if you have 10 positions the gradient of all 10 will point in different directions...so if you take a step after each point you'll zig zag arround a lot. You might even backtrack a bit. If you however take the average of all 10 and do a step you won't be optimal in regards to all points individually, but the path you'll take will be smoother.
So it depends on your dataset. Usually you want to have some smoothing because otherwise you won't converge that easily.
​
The same is true for your example....the center point might not be a good estimate of the surrounding. It could however be that it is close to the average and there isn't that big of a difference.
uncooked-cookie t1_ivpnon9 wrote
The gradient doesn’t give you the optimal improvement direction, it gives you a local improvement direction.
make3333 t1_ivqfpxu wrote
first degree optimal direction
Difficult_Ferret2838 t1_ivrnegq wrote
That doesn't mean anything.
make3333 t1_ivroe1x wrote
gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point. in neural nets we do first degree, as if it was a plane. in a lot of other optimization settings they do second order approx to find the optimal direction
Difficult_Ferret2838 t1_ivrom17 wrote
>gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point.
No. Gradient descent is first order by definition.
>in a lot of other optimization settings they do second order approx to find the optimal direction
It still isn't an "optimal" direction.
kksnicoh t1_ivtla47 wrote
It is optimal in first order :)
Difficult_Ferret2838 t1_ivtprrn wrote
Exactly, that is a meaningless phrase.
bluuerp t1_ivpshwu wrote
Yes I meant the optimal improvement direction for that point.
Spiritual-Reply5896 t1_iw8yhoi wrote
It gives you local improvement direction, but can we straightforwardly think about this metaphora of improvement in 3D and generalize it to thousands of dimensions?
Maybe its a little different question, but do you happen to know where to find research on this topic of generalizability of mathematical operations in interpretable geometrical dimensions to extremely high dimensions? Not looking for theory on vector spaces but on the intuitive aspects
hughperman t1_ivqd886 wrote
Consider though, in a linear scheme, taking each gradient step separately is equal the sum of the gradients. Taking the average is equal to the sum of the gradients divided by the number of steps. So you are only adjusting the step by a scale factor of 1/N, nothing more mathemagical.
CPOOCPOS OP t1_ivowysr wrote
This sounds similar to what fredditor_1 was explaining. I will look into it!
Thanks a lot
Viewing a single comment thread. View all comments