MazenAmria
MazenAmria OP t1_izrgnco wrote
Reply to comment by pr0d_ in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
I remember reading it, I'll read it again and discuss it. Thanks.
MazenAmria OP t1_izrgk9j wrote
Reply to comment by suflaj in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
I'm already using a pretrained model as the teacher model. But the distillation part itself has nearly the cost of training a model. I'm not insisting but I feel like I'm doing something wrong and needed some advices (note that I've only had theoritical experience in such areas of research, this is the first time I'm doing it practically).
Thanks for you comments.
MazenAmria OP t1_izpii1s wrote
Reply to comment by sqweeeeeeeeeeeeeeeps in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
To examine SWIN itself whether it's overparameterized or not.
MazenAmria OP t1_izonquh wrote
Reply to comment by suflaj in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
That's sad; I'm starting to believe that this research idea is impractical or, maybe more accurately, overly ambitious.
MazenAmria OP t1_izon556 wrote
Reply to comment by sqweeeeeeeeeeeeeeeps in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
> I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.
And that's the problem. If I got say 98% accuracy on CIFAR-10 using SWIN-Tiny and then got the same 98% with a smaller model then I'm not proving anything. There are many simple models that can get 98% on CIFAR-10 so what improvement did I introduce to the SWIN-Tiny? But doing the same thing with ImageNet would be different.
MazenAmria OP t1_izt68w9 wrote
Reply to comment by suflaj in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
I'm using
with torch.no_grad():
when calculating the output of the teacher model.