ClearlyCylindrical t1_iqna0cr wrote on October 1, 2022 at 5:20 PM

If it were possible to do full batch all the time minibatches would likely still be used. The stochasticity created by minibatch gradient descent generally improves a models generalisation performance.

Ephemeral_Epoch t1_iqnscns wrote on October 1, 2022 at 7:30 PM

Seems like you could approximate a minibatch with a full batch + noise? Maybe there's a better noising procedure when using full batch gradients.

SNAPscientist t1_iqr3sej wrote on October 2, 2022 at 2:22 PM

Capturing the distribution characteristics of high-dimensional data is very hard. In fact if we could do that well, we might be able to use classic bayesian techniques for many NN problems which would be more principled and interpretable. Any noise one would end up adding by hand is unlikely to introduce the kind of stochasticity that sampling on real data (using minibatches or similar procedures) would. Getting the distribution wrong would likely mean poor generalization.

fasttosmile t1_iqrel80 wrote on October 2, 2022 at 3:35 PM

This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg

The real reason is it's just faster to train on smaller batches (because the steps are quicker).

ClearlyCylindrical t1_iqrmrxz wrote on October 2, 2022 at 4:29 PM

Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?

fasttosmile t1_iqrolwa wrote on October 2, 2022 at 4:40 PM

There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.

UnusualClimberBear t1_iqna2rw wrote on October 1, 2022 at 5:20 PM

The full gradient does not work well for NN. Plus adam has a coarse estimate of the curvature, so it would be more of a second-order method even if you can find some functions where the proposed estimates are not good.

dasayan05 t1_iqnltrg wrote on October 1, 2022 at 6:43 PM

Mini-batches are not here just for memory limitations. They inject noise in the optimization which helps escape local minimas and explores the loss landscape.

029187 OP t1_iqp6s79 wrote on October 2, 2022 at 2:00 AM

what if, as another poster said, we did full batch but also injected noise into it?

dasayan05 t1_iqp8hnf wrote on October 2, 2022 at 2:14 AM

possible. but what is the advantage with that ? even if we did find a way to explicitly noise the data/gradient, we are still better off with mini-batches as they offer less memory consumption

029187 OP t1_iqrinm2 wrote on October 2, 2022 at 4:02 PM

If its only as good, then it has no benefit. But if it ends up being better, then it is useful for situations where we have enough memory.

https://arxiv.org/abs/2103.17182

This paper here is claiming they might have found interesting ways to potentially make it better.

Red-Portal t1_iqpczjb wrote on October 2, 2022 at 2:52 AM

People have tried it, and so far no one has been able to achieve the same effect. It's still somewhat of an open research problem.

029187 OP t1_iqpigzv wrote on October 2, 2022 at 3:42 AM

ah cool! do you have any links to papers on the topic? i'd love to read them!

Red-Portal t1_iqpipq6 wrote on October 2, 2022 at 3:45 AM

I think it was this one: https://arxiv.org/abs/2103.17182

029187 OP t1_iqrihvc wrote on October 2, 2022 at 4:01 PM

thanks!!

gdahl t1_iqpf8j8 wrote on October 2, 2022 at 3:12 AM

Adam is more likely to outperform steepest descent (full batch GD) in the full batch setting than it is to outperform SGD at batch size 1.

suflaj t1_iqns8dp wrote on October 1, 2022 at 7:29 PM

To add on what others have said, you would still likely want mini-batches to better track progress. Even if we had infinite memory there is still a limit to how fast you can process information (even at physical extremes), and so you would not be able to do these operations instantly. Unless there were significant drawbacks to using minibatches, you'd probably take over minibatches with seconds or minutes per update over a hanging loop that updates every X hours.

Creepy-Tackle-944 t1_iqow79w wrote on October 2, 2022 at 12:33 AM

Hard to answer.
A few years ago my answer would be a resounding "hell no", back in those days a batch size of 64 is considered large.

Today training configurations of top-performing models are commonly in the ballpark 4096 images per batch, which I never thought I would see.
This kind of shows that batch size does not really exist in a vacuum but rather coexists with other parameters. For efficiency doing everything in one batch would be desirable since everything is in RAM. However, actually doing so would require coming up with some entirely new set of parameters.
Also, gradient accumulation is a thing, and you can theoretically have a train epoch as a single batch without running OOM, but nobody found that to be effective yet.

crrrr30 t1_iqpciu1 wrote on October 2, 2022 at 2:48 AM

I feel like with that memory available, testing scaling laws is a better research direction than testing full batch

Cheap_Meeting t1_iqq8oku wrote on October 2, 2022 at 8:58 AM

Adding to other answers: Even if you had enough memory, if it would still be computationally inefficient. There is a diminishing return from increasing batch size in terms of how much the loss improves each step.

[Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam?

Comments