Submitted by MazenAmria t3_zhvwvl in deeplearning
Hello everyone, I'm working on my Bachelor's Graduation Project, which is a research of Microsoft's SWIN Transformer architecture and how it would perform when compressed using Knowledge Distillation.
However, I'm having some difficulties training the model because it has to be trained on ImageNet for around 300 epochs. Considering that I want to make several modifications and evaluate them. I have a GTX 1070, which is a good GPU for Deep Learning tasks, however, in this case, it's not even enough to run a single experiment within the given time.
As an alternative approach, I thought of applying the same experiments to the MNIST dataset and comparing the results with the results of training the same student model without any distillation. This way, I can examine the effect of the distillation. But I have some concerns about MNIST itself. Since much simpler models would perform well on MNIST, the results of using SWIN Transformer might be useless or impractical.
I would be happy to hear some advices and opinions.
UPDATE: I've also considered using ImageNet mini (a subset of imagenet that is just 4GB), but the accuracy is improving very slowly.
sqweeeeeeeeeeeeeeeps t1_izob3yb wrote
MNIST and Imagenrt is a huge range. Try something in between, preferably multiple. For example CIFAR-10 and CIFAR-100. I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.