Abstract: Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the results and facilitate research on reusing historical weights for faster convergence.

Comments

IdentifiableParam t1_ir31zjs wrote on October 4, 2022 at 11:32 PM

Weird that this paper didn't seem to cite https://arxiv.org/abs/1409.4842v1 which also used Polyak averaging on models trained on ImageNet.

jeankaddour t1_ir4vnnv wrote on October 5, 2022 at 11:03 AM

Hi, the author here. Thank you for your comment.

While I was aware of GoogLeNet, I didn't read the paper in enough detail to know that they used Polyak averaging too. Thank you for making me aware of it. I'm happy to cite it in the next paper version.

However, the only time they mention averaging is:

Polyak averaging [13] was used to create the final model used at inference time.

My goal with the paper was to study the empirical convergence speed-ups in more detail and be precise about how averaging is used, not claiming to be the first one who applies some sort of averaging to improve the model's final performance (there are plenty of papers that do that already, e.g., the SWA paper mentioned in the related work section).

EDIT: Added the citation to the new version!

bernhard-lehner t1_ir43bht wrote on October 5, 2022 at 4:44 AM

Yeah, thats hardly a novel approach...but I have to admit that I also could spend more time looking if anyone else have had the same idea I'm trying at the moment. We really need "Schmidhuber as a Service" :)

jeankaddour t1_ir4vzq6 wrote on October 5, 2022 at 11:07 AM

Hi, the author here. Thank you for your comment.

My goal with the paper was not to present weight averaging as a novel approach; rather, to study the empirical convergence speed-ups in more detail.

Please have a look at the related work section where I discuss previous works using weight averaging, and feel free to let me know if I missed one that focuses on speedups.

jeankaddour t1_irdv4a9 wrote on October 7, 2022 at 8:57 AM

Thanks again for this pointer; this citation has now been added to the today-announced version.

TheInfelicitousDandy t1_ir53amf wrote on October 5, 2022 at 12:20 PM

Why Roberta on Wiki103? Wiki103 is generally used as a benchmark for training autoregressive language models, not MLM models.

jeankaddour t1_ir5iybp wrote on October 5, 2022 at 2:22 PM

Hi, the author here. Thanks for your interest!

I simply followed the RoBERTa pre-training script provided by fairseq, which happened to use Wiki103. Is there a reason why Wiki103 would be less suited for MLM than for AR models?

TheInfelicitousDandy t1_ir5syrj wrote on October 5, 2022 at 3:30 PM

It's not really about being suitable, but rather it doesn't follow standard evaluation set-ups. In general, MLM models are trained on some version of book corpus + a wiki dump (and each model tends to use their own version which makes comparisons hard). As such, RoBERTa is really meant to be trained on much more data. That training recipe only uses Wiki103 because it is smallish and easily available. 1) By training on a smaller dataset, you risk introducing a bunch of issues like overfitting and not being hyper-parameter tuned, which for an optimization paper, are kind of important. 2) It also means I can't easily compare it to previous works despite knowing the literature well. I'm pretty sure that a PPL of 4 is really high for a bidirectional model (even if its very low when considering an autoregressive model) 3) PPL isn't even valid for a bidirectional model since it doesn't form a valid probability over a sequence.

I guess this caught my eye (and I don't mean to be rude) because it feels like something someone would do when they don't really know the language modelling literature but just want to use it as a test environment for optimization. The easy solution would just be to use the adapative-inputs LM on wiki103 which is in fairseq and a standard model/dataset combination with reproducible training scripts.

jeankaddour t1_ir66cik wrote on October 5, 2022 at 4:57 PM

Thank you very much. This is extremely useful feedback and I appreciate your time spent on writing it! I will look into using the adapative-inputs LM on wiki103 the next time. I believe that bookcorpus + a wiki dump will likely not be in my computational budget, but I might try. Your guess of me being new to the LM literature and only wanting to use it as a testbed for optimization is right :) therefore, again, thanks for sharing your insights!

jeankaddour t1_irdvad7 wrote on October 7, 2022 at 9:00 AM

Thanks again for this feedback. I haven't trained on a different dataset yet, but I replaced all BERT perplexity numbers/plots with the MLM losses in the meantime. The paper has been updated today on Arxiv.