I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet.

The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources.

For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set?

NOTE: We will only be using Tensorflow...

Comments

MadScientist-1214 t1_j6433qc wrote on January 27, 2023 at 3:58 PM

At my institute, nobody trained on ImageNet, so I had to figure it out myself too. If you train on architectures like VGG, it does not take long. <2 days on a single A100, with worse GPU max. 5 days. The most important thing is to use SSD, this increases speed by around 2 days. A good learning scheduler is really important. Most researchers ignore the test set, use only validation set. And also important: use mixed precision. You should really tune the training speed, if you need to do a lot of experiments.

[deleted] t1_j65tyr5 wrote on January 27, 2023 at 10:37 PM

[deleted]

MyActualUserName99 OP t1_j668iiq wrote on January 28, 2023 at 12:19 AM

Yes, they had 40GB A100 GPUs in 2015

[deleted] t1_j669ixw wrote on January 28, 2023 at 12:27 AM

[deleted]

MyActualUserName99 OP t1_j669vl1 wrote on January 28, 2023 at 12:29 AM

Where at? Everything I can find is like 3$ . Cheapest I can find is google colab $10 for 7.5 hours, but you’re limited by ram and your node will drop at any given time

[deleted] t1_j66a0bw wrote on January 28, 2023 at 12:30 AM

[deleted]

[deleted] t1_j669ec0 wrote on January 28, 2023 at 12:26 AM

[deleted]

numpee t1_j67dlt8 wrote on January 28, 2023 at 6:13 AM

So with ImageNet, the main performance (speed) bottleneck is actually data loading, especially if your models are not that large (such as Res18, Res50). ImageNet is roughly 1.2M images (train) that are roughly <1MB each, which means that you're performing random reads 1.2M times for each epoch. Modern NVME SSDs have great sequential read speeds, but still lack random read accesses (which is the case if you're shuffling the image order at each epoch). BTW, data loading won't be a bottleneck if you're training models like ViT or even Res152.

I highly suggest you try out a dataset format such as FFCV or WebDataset. I personally use FFCV, which is extremely fast because it caches data onto your RAM. But there definitely are some limitations, such as code compatability or not enough RAM to cache all images (this is something you should check on the server-side). You can remap ImageNet to the FFCV/WebDataset format on a local machine, then transfer your data to the server for training.

Just for reference, one epoch of training ImageNet on 4x A6000 (roughly 2~2.5x slower than A100) with Res18 takes me around 3 minutes using FFCV. But, using A100s won't necessarily be faster because even with FFCV, data loading itself takes 2~3mins without model forward/backward. IIRC, with ordinary data loading, you'd be looking at around 10~15 minutes per epoch.

If you want more details, feel free to DM me.

arg_max t1_j65z4o9 wrote on January 27, 2023 at 11:12 PM

You could use ffcv, it improves data loading speed and some bottlenecks without you needing to change train code much.

MyActualUserName99 OP t1_j668o4r wrote on January 28, 2023 at 12:20 AM

I’ll definitely check it out!

marcingrzegzhik t1_j67whko wrote on January 28, 2023 at 10:29 AM

Hi there!

I have trained ImageNet several times myself, using both local and cloud resources.

I would recommend starting with a tutorial on how to get it running locally - there are many out there. As for the cloud resources, I found Google Cloud to be the best option in terms of cost/performance. In terms of expenses, the cost of training on ImageNet can be quite high, depending on the resources you use. As for the model, I would recommend running the model at least three times, so that you can get an accurate estimate of the performance. As for early stopping, I would recommend using the validation set - this will give you a more accurate representation of the model's performance.

Hope this helps!

tornado28 t1_j65wtvp wrote on January 27, 2023 at 10:56 PM

You might be able to get some free compute from AWS or GCP

nmfisher t1_j675n9m wrote on January 28, 2023 at 4:49 AM

Have you tried applying for the Google TPU Research program?

https://sites.research.google/trc/about/

deepestdescent t1_j67iajc wrote on January 28, 2023 at 7:11 AM

I use PyTorch data loaders to load batches into memory in the background. I believe TensorFlow has similar functionality with tf.data. This should make your data loading speed basically negligible if you have a few CPU cores lying around.

Meddhouib10 t1_j67tmqr wrote on January 28, 2023 at 9:48 AM

+1 but I think he already knows that

Meddhouib10 t1_j67tlnc wrote on January 28, 2023 at 9:47 AM

You can train/pretrain only on ImageNet1k and compare with swinv2 and ConvNextV2 trained only on ImageNet1k (both share their results on this setting)

Meddhouib10 t1_j67v88c wrote on January 28, 2023 at 10:11 AM

Also if use mixed precision. And if you have a TPU use Xla