Submitted by MazenAmria t3_zhvwvl in deeplearning
Hello everyone, I'm working on my Bachelor's Graduation Project, which is a research of Microsoft's SWIN Transformer architecture and how it would perform when compressed using Knowledge Distillation.
However, I'm having some difficulties training the model because it has to be trained on ImageNet for around 300 epochs. Considering that I want to make several modifications and evaluate them. I have a GTX 1070, which is a good GPU for Deep Learning tasks, however, in this case, it's not even enough to run a single experiment within the given time.
As an alternative approach, I thought of applying the same experiments to the MNIST dataset and comparing the results with the results of training the same student model without any distillation. This way, I can examine the effect of the distillation. But I have some concerns about MNIST itself. Since much simpler models would perform well on MNIST, the results of using SWIN Transformer might be useless or impractical.
I would be happy to hear some advices and opinions.
UPDATE: I've also considered using ImageNet mini (a subset of imagenet that is just 4GB), but the accuracy is improving very slowly.
suflaj t1_izoh23q wrote
As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)
So, take at least the small version.