ElectronicCress3132 t1_j629tix wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
> implement a gradient descent optimization process at inference time
Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?
Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.
Acceptable-Cress-374 t1_j62qh5g wrote
Thank you for putting it into words, I was having trouble understanding this myself.
Viewing a single comment thread. View all comments