Submitted by currentscurrents t3_10ly7rw in MachineLearning
Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2
>Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning.
currentscurrents OP t1_j608oz5 wrote
TL;DR:
In-context learning (ICL) is the ability of language models to "learn from example" to perform new tasks just based on prompting. These researchers are studying the mechanism behind ICL.
They show that the attention layers allow transformers to implement a gradient descent optimization process at inference time. This mechanism produces very similar results to explicit optimization through fine-tuning, but was itself learned by optimization through gradient descent.
Based on this finding they apply momentum, a technique known to improve optimizers, to transformer attention layers. This produces a small-but-consistent improvement in performance on all tested tasks. They suggest that there are more improvements to be made by explicitly biasing transformers towards meta-optimization.
This reminds me of some meta-learning architectures that try to intentionally include gradient descent as part of inference (https://arxiv.org/abs/1909.04630) - the difference here is that LLMs somehow learned this technique during training. The implication is pretty impressive: at enough scale, meta-learning just emerges by itself because it's a good solution to the problem.
Other researchers are looking into ICL as well, here's another recent paper on the topic: https://arxiv.org/abs/2211.15661