Submitted by Chemont t3_109z8om in MachineLearning
I recently came across " Confident Adaptive Language Modeling " which allows Transformers to exit early during inference and not use all model layers if a token is easy to predict. Is there any research on basically doing the opposite and allowing Transformers to spent more compute on tokens that are very hard to predict?
amrit_za t1_j418a4l wrote
It sounds like what you're considering the "opposite" is just a reframing of original task i.e. if a token is difficult to predict, then more layers (and therefore compute) would be used used. If it's easy, fewer layers. Am I missing something from what you're asking?