Hello everyone, I'm working on my Bachelor's Graduation Project, which is a research of Microsoft's SWIN Transformer architecture and how it would perform when compressed using Knowledge Distillation.

However, I'm having some difficulties training the model because it has to be trained on ImageNet for around 300 epochs. Considering that I want to make several modifications and evaluate them. I have a GTX 1070, which is a good GPU for Deep Learning tasks, however, in this case, it's not even enough to run a single experiment within the given time.

As an alternative approach, I thought of applying the same experiments to the MNIST dataset and comparing the results with the results of training the same student model without any distillation. This way, I can examine the effect of the distillation. But I have some concerns about MNIST itself. Since much simpler models would perform well on MNIST, the results of using SWIN Transformer might be useless or impractical.

I would be happy to hear some advices and opinions.

UPDATE: I've also considered using ImageNet mini (a subset of imagenet that is just 4GB), but the accuracy is improving very slowly.

Comments

sqweeeeeeeeeeeeeeeps t1_izob3yb wrote on December 10, 2022 at 4:44 PM

#900,944

MNIST and Imagenrt is a huge range. Try something in between, preferably multiple. For example CIFAR-10 and CIFAR-100. I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.

suflaj t1_izoh23q wrote on December 10, 2022 at 5:23 PM

#901,235

As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)

So, take at least the small version.

MazenAmria OP t1_izon556 wrote on December 10, 2022 at 6:03 PM

#901,536

Replying to sqweeeeeeeeeeeeeeeps (#900,944)

> I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.

And that's the problem. If I got say 98% accuracy on CIFAR-10 using SWIN-Tiny and then got the same 98% with a smaller model then I'm not proving anything. There are many simple models that can get 98% on CIFAR-10 so what improvement did I introduce to the SWIN-Tiny? But doing the same thing with ImageNet would be different.

MazenAmria OP t1_izonquh wrote on December 10, 2022 at 6:08 PM

#901,556

Replying to suflaj (#901,235)

That's sad; I'm starting to believe that this research idea is impractical or, maybe more accurately, overly ambitious.

suflaj t1_izorabe wrote on December 10, 2022 at 6:32 PM

#901,713

Replying to MazenAmria (#901,556)

I don't think it's SWIN per se. I think the detectors (which take 5 feature maps of different level of detail) are incompatible with the 4 blocks of transformers which lack the spatial bias convolutional networks provide and the Tiny model being too small.

Other than that, pretraining (near) SOTA models is impractical for anyone other than big corpo for quite some time now. But you could always try asking your mentor for your uni's compute - my faculty offered GPUs ranging from 1080Tis to A100s.

Although I don't realize why you insist on pretraining SWIN, many SWIN models pretrained on ImageNet are already available. So you just have to do the distillation part on some part of the pretraining input distribution. Not only offered as part of MMCV, but Huggingface as well.

sqweeeeeeeeeeeeeeeps t1_izphlmd wrote on December 10, 2022 at 9:33 PM

#902,714

Replying to MazenAmria (#901,536)

? You are proving your SWIN model is overparameterized for CIFAR. Make an EVEN simpler model than those, you prob won’t be able to with off the shelf distillation. Doing this just for ImageNet literally doesn’t change anything. It’s just a different more complex dataset.

What’s your end goal? To come up with a distillation technique to make NN’s more efficient and smaller?

MazenAmria OP t1_izpii1s wrote on December 10, 2022 at 9:39 PM

#902,740

Replying to sqweeeeeeeeeeeeeeeps (#902,714)

To examine SWIN itself whether it's overparameterized or not.

pr0d_ t1_izqj9d8 wrote on December 11, 2022 at 2:32 AM

#904,247

any chance you've read the DEIT papers?

pr0d_ t1_izqjmmk wrote on December 11, 2022 at 2:35 AM

#904,263

Replying to MazenAmria (#902,740)

yeah as per my comment, the DEiT papers explored knowledge distillation based off Vision Transformers. What you want to do here is probably similar, and the resources needed to prove it is huge to say the list. Any chance you've discussed this with your advisor?

MazenAmria OP t1_izrgk9j wrote on December 11, 2022 at 7:50 AM

#905,284

Replying to suflaj (#901,713)

I'm already using a pretrained model as the teacher model. But the distillation part itself has nearly the cost of training a model. I'm not insisting but I feel like I'm doing something wrong and needed some advices (note that I've only had theoritical experience in such areas of research, this is the first time I'm doing it practically).

Thanks for you comments. gif

MazenAmria OP t1_izrgnco wrote on December 11, 2022 at 7:51 AM

#905,287

Replying to pr0d_ (#904,263)

I remember reading it, I'll read it again and discuss it. Thanks.

suflaj t1_izruvvi wrote on December 11, 2022 at 11:15 AM

#905,677

Replying to MazenAmria (#905,284)

That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.

Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.

sqweeeeeeeeeeeeeeeps t1_izspv5o wrote on December 11, 2022 at 3:57 PM

#906,882

Replying to MazenAmria (#902,740)

Showing you can create a smaller model with the same performance means SWIN is overparameterized for that given task. Give it datasets with varying complexity, not just one single one.

MazenAmria OP t1_izt68w9 wrote on December 11, 2022 at 5:51 PM

#907,611

Replying to suflaj (#905,677)

I'm using with torch.no_grad(): when calculating the output of the teacher model.

suflaj t1_iztjolh wrote on December 11, 2022 at 7:16 PM

#908,079

Replying to MazenAmria (#907,611)

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation