MazenAmria OP t1_izrgk9j wrote on December 11, 2022 at 7:50 AM

Reply to comment by suflaj in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria

I'm already using a pretrained model as the teacher model. But the distillation part itself has nearly the cost of training a model. I'm not insisting but I feel like I'm doing something wrong and needed some advices (note that I've only had theoritical experience in such areas of research, this is the first time I'm doing it practically).

Thanks for you comments. gif

suflaj t1_izruvvi wrote on December 11, 2022 at 11:15 AM

That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.

Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.

MazenAmria OP t1_izt68w9 wrote on December 11, 2022 at 5:51 PM

I'm using with torch.no_grad(): when calculating the output of the teacher model.

suflaj t1_iztjolh wrote on December 11, 2022 at 7:16 PM

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?