Looking at some old tables:

https://arxiv.org/pdf/1512.03385.pdf, Table 4

https://arxiv.org/pdf/1905.11946.pdf, Table 2

Why do the ResNet-152 results vary? E.g. Top-1 error on ImageNet validation set is 19.38 in the original, but 22.2 in the EfficientNet paper.

Normally I would assume these type of results would be copied from the previous publication.

Comments

You must log in or register to comment.

CatalyzeX_code_bot t1_je0at75 wrote on March 28, 2023 at 2:25 PM

Found relevant code at https://github.com/c1ph3rr/Deep-Residual-Learning-for-Image-Recognition + all code implementations here

Found relevant code at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet + all code implementations here

To opt out from receiving code links, DM me

SFDeltas t1_je0yp5q wrote on March 28, 2023 at 4:59 PM

u/CatalyzeX_code_bot What are you doing

CatalyzeX_code_bot t1_je161ww wrote on March 28, 2023 at 5:45 PM

fixing :) sorry about that

suflaj t1_je1uvo8 wrote on March 28, 2023 at 8:19 PM

They probably redid the experiments themselves. Also, ResNets had some changes shortly after release I believe, and they could have used different pretraining weights. AFAIK He et al. never released their weights.

Furthermore, Wolfram and PyTorch pretrained weights are also around 22% top-1 error rate, so that is probably the correct error rate. Since PyTorch provides weights that reach 18% top-1 error rate with some small adjustments to the training procedure, it is possible the authors got lucky with the hyperparameters, or employed some techniques they didn't describe in the paper.

kaphed OP t1_je4slfg wrote on March 29, 2023 at 12:41 PM

thanks!

[deleted] t1_je06cl7 wrote on March 28, 2023 at 1:54 PM

[deleted]

[deleted] t1_je080cm wrote on March 28, 2023 at 2:06 PM

[deleted]

U03B1Q t1_je4xekj wrote on March 29, 2023 at 1:20 PM

https://dl.acm.org/doi/10.1145/3324884.3416545

There was an ASE paper that found that even under identical hyperparameter seed settings networks had a variance of about 2% due to non-determinism in the parallel computing workflow. If they chose to retrain it instead of copying the old numbers, this performance discrepancy is in line with this work.

[deleted] t1_je0diwt wrote on March 28, 2023 at 2:43 PM

[deleted]

[deleted] t1_je0gx71 wrote on March 28, 2023 at 3:06 PM

[deleted]

[deleted] t1_je0jwlg wrote on March 28, 2023 at 3:25 PM

[deleted]

MOSFETBJT t1_je3xbo7 wrote on March 29, 2023 at 6:08 AM

Probably batch size differences?