Submitted by netw0rkf10w t3_zmpdo0 in MachineLearning

I am looking for the hyper-parameter settings that could produce the highest accuracies for plain ViT (i.e., without modifying the model architecture) on ImageNet-1K, training from scratch. A lot of people in this sub have experience with ViT so I hope I could get some help here.

For ViT-S, we have a recipe that can achieve 80.0% top-1 accuracy from this paper: Better plain ViT baselines for ImageNet-1k. Unfortunately they did not experiment with larger architecture (ViT-B or ViT-L).

For ViT-B, ViT-L and ViT-H, the authors of MAE claimed to achieve 82.3%, 82.6% and 83.1%, respectively (see their Table 3). However, I was unable to reproduce these results using their code and their reported hyper-parameters.

Any references to strong ViT baselines with reproducible results would be very much appreciated! Thanks.

12

Comments

You must log in or register to comment.

netw0rkf10w OP t1_j0gcgxy wrote

Thanks. DeiT is actually a very nice paper from which one can learn a lot of things. But the training regimes that they used seem a bit long to me: 300 to 800 epochs. The authors of MAE managed to achieve 82.3% for ViT-B after only 100 epochs, so I'm wondering if anyone in the literature has ever been able to match that.

1