NichtMarlon

NichtMarlon t1_j3ikrkw wrote

Yes its very useful for text classification tasks. Big transformers get highest accuracy, but we can't deploy them because they are too slow. So we distil knowledge from bigger transformers into smaller transformers or CNNs. If you have a decent amount of unlabeled data to pseudo-label with the teacher, there is barely any loss in accuracy for the student model.

4