IndecisivePhysicist t1_j20lxts wrote on December 28, 2022 at 8:10 PM

Reply to comment by derpderp3200 in [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.

IndecisivePhysicist t1_j1yd893 wrote on December 28, 2022 at 8:29 AM

Reply to [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200

This is good actually. The "noise" of batch sgd acts as a regularizer. You don't actually want to find the global minimum of the training set but rather you want a generalizeable minimum which usually means a flat minimum because there will be some distribution shift at test time. The minimum being slightly different for each minibatch helps draw you toward a flat minimum.