jeankaddour
jeankaddour t1_irdv4a9 wrote
Reply to comment by IdentifiableParam in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Thanks again for this pointer; this citation has now been added to the today-announced version.
jeankaddour t1_ir66cik wrote
Reply to comment by TheInfelicitousDandy in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Thank you very much. This is extremely useful feedback and I appreciate your time spent on writing it! I will look into using the adapative-inputs LM on wiki103 the next time. I believe that bookcorpus + a wiki dump will likely not be in my computational budget, but I might try. Your guess of me being new to the LM literature and only wanting to use it as a testbed for optimization is right :) therefore, again, thanks for sharing your insights!
jeankaddour t1_ir5iybp wrote
Reply to comment by TheInfelicitousDandy in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Hi, the author here. Thanks for your interest!
I simply followed the RoBERTa pre-training script provided by fairseq, which happened to use Wiki103. Is there a reason why Wiki103 would be less suited for MLM than for AR models?
jeankaddour t1_ir4vzq6 wrote
Reply to comment by bernhard-lehner in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Hi, the author here. Thank you for your comment.
My goal with the paper was not to present weight averaging as a novel approach; rather, to study the empirical convergence speed-ups in more detail.
Please have a look at the related work section where I discuss previous works using weight averaging, and feel free to let me know if I missed one that focuses on speedups.
jeankaddour t1_ir4vnnv wrote
Reply to comment by IdentifiableParam in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Hi, the author here. Thank you for your comment.
While I was aware of GoogLeNet, I didn't read the paper in enough detail to know that they used Polyak averaging too. Thank you for making me aware of it. I'm happy to cite it in the next paper version.
However, the only time they mention averaging is:
Polyak averaging [13] was used to create the final model used at inference time.
My goal with the paper was to study the empirical convergence speed-ups in more detail and be precise about how averaging is used, not claiming to be the first one who applies some sort of averaging to improve the model's final performance (there are plenty of papers that do that already, e.g., the SWA paper mentioned in the related work section).
​
EDIT: Added the citation to the new version!
jeankaddour t1_irdvad7 wrote
Reply to comment by jeankaddour in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
Thanks again for this feedback. I haven't trained on a different dataset yet, but I replaced all BERT perplexity numbers/plots with the MLM losses in the meantime. The paper has been updated today on Arxiv.