Submitted by MazenAmria t3_zhvwvl in deeplearning
suflaj t1_iztjolh wrote
Reply to comment by MazenAmria in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria
Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.
As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?
Viewing a single comment thread. View all comments