Submitted by netw0rkf10w t3_zmpdo0 in MachineLearning
I am looking for the hyper-parameter settings that could produce the highest accuracies for plain ViT (i.e., without modifying the model architecture) on ImageNet-1K, training from scratch. A lot of people in this sub have experience with ViT so I hope I could get some help here.
For ViT-S, we have a recipe that can achieve 80.0% top-1 accuracy from this paper: Better plain ViT baselines for ImageNet-1k. Unfortunately they did not experiment with larger architecture (ViT-B or ViT-L).
For ViT-B, ViT-L and ViT-H, the authors of MAE claimed to achieve 82.3%, 82.6% and 83.1%, respectively (see their Table 3). However, I was unable to reproduce these results using their code and their reported hyper-parameters.
Any references to strong ViT baselines with reproducible results would be very much appreciated! Thanks.
CatalyzeX_code_bot t1_j0calqn wrote
Found relevant code at https://github.com/google-research/big_vision + all code implementations here
--
To opt out from receiving code links, DM me