Submitted by fredlafrite t3_106no9h in MachineLearning
There's been a ton of academic work exploring knowledge distillation techniques, sparsity in networks and many others, often with vast numbers of citations. I was wondering what the status of those in real-world ML was. Has any of you used it in a concrete situation? What did you find to work best for you?
suflaj t1_j3igfzr wrote
Yes, it's the only way to get high throughput high performance models ATM.
With KD and TensorRT you can get close to 100x throughput (compared to eager TF/PyTorch on full model) with 1% performance hit on some models and tasks.