suflaj t1_j3vg5tm wrote on January 11, 2023 at 9:37 AM

Reply to comment by xenotecc in [D] Have you ever used Knowledge Distillation in practice? by fredlafrite

I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.

There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.

xenotecc t1_j3vp75c wrote on January 11, 2023 at 11:38 AM

Thank you for the reply!