I am looking for the hyper-parameter settings that could produce the highest accuracies for plain ViT (i.e., without modifying the model architecture) on ImageNet-1K, training from scratch. A lot of people in this sub have experience with ViT so I hope I could get some help here.

For ViT-S, we have a recipe that can achieve 80.0% top-1 accuracy from this paper: Better plain ViT baselines for ImageNet-1k. Unfortunately they did not experiment with larger architecture (ViT-B or ViT-L).

For ViT-B, ViT-L and ViT-H, the authors of MAE claimed to achieve 82.3%, 82.6% and 83.1%, respectively (see their Table 3). However, I was unable to reproduce these results using their code and their reported hyper-parameters.

Any references to strong ViT baselines with reproducible results would be very much appreciated! Thanks.

Comments

You must log in or register to comment.

CatalyzeX_code_bot t1_j0calqn wrote on December 15, 2022 at 4:40 PM

#937,371

Found relevant code at https://github.com/google-research/big_vision + all code implementations here

To opt out from receiving code links, DM me

TimDarcet t1_j0cpta3 wrote on December 15, 2022 at 6:17 PM

#938,120

I think Deit III is pretty sota

TimDarcet t1_j0cpy9m wrote on December 15, 2022 at 6:18 PM

#938,130

Replying to TimDarcet (#938,120)

There's also this one with very strong results, but it's a bit less straightforward to train

netw0rkf10w OP t1_j0gcgxy wrote on December 16, 2022 at 1:10 PM

#943,875

Replying to TimDarcet (#938,120)

Thanks. DeiT is actually a very nice paper from which one can learn a lot of things. But the training regimes that they used seem a bit long to me: 300 to 800 epochs. The authors of MAE managed to achieve 82.3% for ViT-B after only 100 epochs, so I'm wondering if anyone in the literature has ever been able to match that.

TimDarcet t1_j1w6ifs wrote on December 27, 2022 at 9:19 PM

#1,100,406

Replying to netw0rkf10w (#943,875)

I think the supervised training they report in MAE is 300 epochs, they used a different recipe compared to finetuning (appendix, page 12, table 11)

netw0rkf10w OP t1_j2939o2 wrote on December 30, 2022 at 3:18 PM

#1,182,935

Replying to TimDarcet (#1,100,406)

You are right, indeed. Not sure why I missed that. I guess one can conclude that DeiT 3 is currently SoTA for training from scratch.