Submitted by MyActualUserName99 t3_10mmniu in MachineLearning
I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet.
The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources.
For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set?
NOTE: We will only be using Tensorflow...
MadScientist-1214 t1_j6433qc wrote
At my institute, nobody trained on ImageNet, so I had to figure it out myself too. If you train on architectures like VGG, it does not take long. <2 days on a single A100, with worse GPU max. 5 days. The most important thing is to use SSD, this increases speed by around 2 days. A good learning scheduler is really important. Most researchers ignore the test set, use only validation set. And also important: use mixed precision. You should really tune the training speed, if you need to do a lot of experiments.