Submitted by 029187 t3_xt0h2k in MachineLearning
Do optimization techniques like Adam exist primarily to overcome the noise created by mini-batch gradient descent, or would they be beneficial even if we were doing full batches every time?
Submitted by 029187 t3_xt0h2k in MachineLearning
Do optimization techniques like Adam exist primarily to overcome the noise created by mini-batch gradient descent, or would they be beneficial even if we were doing full batches every time?
Seems like you could approximate a minibatch with a full batch + noise? Maybe there's a better noising procedure when using full batch gradients.
Capturing the distribution characteristics of high-dimensional data is very hard. In fact if we could do that well, we might be able to use classic bayesian techniques for many NN problems which would be more principled and interpretable. Any noise one would end up adding by hand is unlikely to introduce the kind of stochasticity that sampling on real data (using minibatches or similar procedures) would. Getting the distribution wrong would likely mean poor generalization.
This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg
The real reason is it's just faster to train on smaller batches (because the steps are quicker).
Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?
There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.
The full gradient does not work well for NN. Plus adam has a coarse estimate of the curvature, so it would be more of a second-order method even if you can find some functions where the proposed estimates are not good.
Mini-batches are not here just for memory limitations. They inject noise in the optimization which helps escape local minimas and explores the loss landscape.
what if, as another poster said, we did full batch but also injected noise into it?
possible. but what is the advantage with that ? even if we did find a way to explicitly noise the data/gradient, we are still better off with mini-batches as they offer less memory consumption
If its only as good, then it has no benefit. But if it ends up being better, then it is useful for situations where we have enough memory.
​
https://arxiv.org/abs/2103.17182
​
This paper here is claiming they might have found interesting ways to potentially make it better.
People have tried it, and so far no one has been able to achieve the same effect. It's still somewhat of an open research problem.
ah cool! do you have any links to papers on the topic? i'd love to read them!
I think it was this one: https://arxiv.org/abs/2103.17182
thanks!!
Adam is more likely to outperform steepest descent (full batch GD) in the full batch setting than it is to outperform SGD at batch size 1.
To add on what others have said, you would still likely want mini-batches to better track progress. Even if we had infinite memory there is still a limit to how fast you can process information (even at physical extremes), and so you would not be able to do these operations instantly. Unless there were significant drawbacks to using minibatches, you'd probably take over minibatches with seconds or minutes per update over a hanging loop that updates every X hours.
Hard to answer.
A few years ago my answer would be a resounding "hell no", back in those days a batch size of 64 is considered large.
Today training configurations of top-performing models are commonly in the ballpark 4096 images per batch, which I never thought I would see.
This kind of shows that batch size does not really exist in a vacuum but rather coexists with other parameters. For efficiency doing everything in one batch would be desirable since everything is in RAM. However, actually doing so would require coming up with some entirely new set of parameters.
Also, gradient accumulation is a thing, and you can theoretically have a train epoch as a single batch without running OOM, but nobody found that to be effective yet.
I feel like with that memory available, testing scaling laws is a better research direction than testing full batch
Adding to other answers: Even if you had enough memory, if it would still be computationally inefficient. There is a diminishing return from increasing batch size in terms of how much the loss improves each step.
ClearlyCylindrical t1_iqna0cr wrote
If it were possible to do full batch all the time minibatches would likely still be used. The stochasticity created by minibatch gradient descent generally improves a models generalisation performance.