Viewing a single comment thread. View all comments

suflaj t1_iztjolh wrote

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

1