
IndecisivePhysicist t1_j20lxts wrote

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.


IndecisivePhysicist t1_j1yd893 wrote

This is good actually. The "noise" of batch sgd acts as a regularizer. You don't actually want to find the global minimum of the training set but rather you want a generalizeable minimum which usually means a flat minimum because there will be some distribution shift at test time. The minimum being slightly different for each minibatch helps draw you toward a flat minimum.