Yes its very useful for text classification tasks. Big transformers get highest accuracy, but we can't deploy them because they are too slow. So we distil knowledge from bigger transformers into smaller transformers or CNNs. If you have a decent amount of unlabeled data to pseudo-label with the teacher, there is barely any loss in accuracy for the student model.
NichtMarlon t1_j3ikrkw wrote
Reply to [D] Have you ever used Knowledge Distillation in practice? by fredlafrite
Yes its very useful for text classification tasks. Big transformers get highest accuracy, but we can't deploy them because they are too slow. So we distil knowledge from bigger transformers into smaller transformers or CNNs. If you have a decent amount of unlabeled data to pseudo-label with the teacher, there is barely any loss in accuracy for the student model.