yanivbl t1_iwgb683 wrote on November 15, 2022 at 12:51 PM

They had more GPUs, training in parallel. Not sure about cifar10 but I read the number for ADM with imagenet is ~1000 days for a single V100.

ButterscotchLost421 OP t1_iwggwv8 wrote on November 15, 2022 at 1:43 PM

Thank you! What do you mean by ADM? Adam?

When training in parallel, which technique did they use? Calculate the gradient of a batch of size `N` on each of the devices and then synchronizing all the different devices to get the mean gradient?

yanivbl t1_iwgjnht wrote on November 15, 2022 at 2:06 PM

No, not Adam, I was referring to the model from the diffusion beats Gans paper.

I never trained such model, just read it. But yeah it's most likely what you said (a.k.a data parallelism)