Viewing a single comment thread. View all comments

IndecisivePhysicist t1_j1yd893 wrote

This is good actually. The "noise" of batch sgd acts as a regularizer. You don't actually want to find the global minimum of the training set but rather you want a generalizeable minimum which usually means a flat minimum because there will be some distribution shift at test time. The minimum being slightly different for each minibatch helps draw you toward a flat minimum.

1

derpderp3200 OP t1_j1yh8qe wrote

But is it the most efficient and effective method?

I'd imagine it's likely possible to converge much faster, and that at some point into training, you likely run into a "limit" where the "signal"(learnable features) can no longer overcome the "noise"(the "pull effect").

1

nonotan t1_j1yovo3 wrote

It's probably not the most efficient method. However, in general, methods that converge faster tend to lead to slightly worse minima (think momentum-based methods vs "plain" SGD), which "intuitively" makes some degree of sense (the additional time spent training isn't completely wasted, with some of it effectively helping explore the possibility space, optimizing the model in ways that simple gradient-following might miss entirely)

I would be shocked if there doesn't exist a method that does even better than SGD while also being significantly more efficient. But it's probably not going to be easy to find, and I expect most simple heuristics ("this seems to be helping, do it more" or "this doesn't seem to be helping, do it less") will lead to training time vs accuracy tradeoffs, rather than universal improvements.

3

IndecisivePhysicist t1_j20lxts wrote

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.

1