Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2

>Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning.

Comments

You must log in or register to comment.

currentscurrents OP t1_j608oz5 wrote on January 26, 2023 at 8:12 PM

#1,512,995

TL;DR:

In-context learning (ICL) is the ability of language models to "learn from example" to perform new tasks just based on prompting. These researchers are studying the mechanism behind ICL.
They show that the attention layers allow transformers to implement a gradient descent optimization process at inference time. This mechanism produces very similar results to explicit optimization through fine-tuning, but was itself learned by optimization through gradient descent.
Based on this finding they apply momentum, a technique known to improve optimizers, to transformer attention layers. This produces a small-but-consistent improvement in performance on all tested tasks. They suggest that there are more improvements to be made by explicitly biasing transformers towards meta-optimization.

This reminds me of some meta-learning architectures that try to intentionally include gradient descent as part of inference (https://arxiv.org/abs/1909.04630) - the difference here is that LLMs somehow learned this technique during training. The implication is pretty impressive: at enough scale, meta-learning just emerges by itself because it's a good solution to the problem.

Other researchers are looking into ICL as well, here's another recent paper on the topic: https://arxiv.org/abs/2211.15661

[deleted] t1_j61h1lt wrote on January 27, 2023 at 1:06 AM

#1,518,742

Replying to currentscurrents (#1,512,995)

[deleted]

lucidraisin t1_j61h7lf wrote on January 27, 2023 at 1:07 AM

#1,518,767

Replying to currentscurrents (#1,512,995)

and one more paper along same lines! https://arxiv.org/abs/2212.07677

currentscurrents OP t1_j61ndkl wrote on January 27, 2023 at 1:53 AM

#1,519,554

Replying to lucidraisin (#1,518,767)

Thanks for the link!

I think it's interesting that they spent so much time in the 90s trying to make meta-learning work, and now it appears emergently just from throwing scale at the problem.

curiousshortguy t1_j61silr wrote on January 27, 2023 at 2:32 AM

#1,520,206

Replying to currentscurrents (#1,512,995)

This is cool, thanks for sharing

ApprehensiveNature69 t1_j61wedt wrote on January 27, 2023 at 3:01 AM

#1,520,732

This is really awesome!

master3243 t1_j61wtpt wrote on January 27, 2023 at 3:05 AM

#1,520,791

This is great work in collaboration with Microsoft Research. I'll have to read more than just the abstract and quickly skimming over it.

My only slight annoyance is the word "Secretly" in the title, I just feel a better word would be "implicitly" that would also be less "clickbait'-y

rjromero t1_j61ytag wrote on January 27, 2023 at 3:20 AM

#1,521,057

This is incredible research. Finally a lead on how we might get to "true" one shot / few shot learning.

VisceralExperience t1_j61znkf wrote on January 27, 2023 at 3:26 AM

#1,521,177

The amount of blatant anthropomorphism that comes from AI researchers is so disgusting. Laymen knowledge about the state of the field is already twisted enough from reality, and the researchers are 100% to blame. Seriously, I'd like to see papers getting rejected for this delusional framing of results.

DigThatData t1_j61zv3l wrote on January 27, 2023 at 3:28 AM

#1,521,196

Replying to currentscurrents (#1,519,554)

Compute Is All You Need

robdogcronin t1_j61zvce wrote on January 27, 2023 at 3:28 AM

#1,521,197

Replying to currentscurrents (#1,519,554)

That's the bitter lesson

currentscurrents OP t1_j620shg wrote on January 27, 2023 at 3:35 AM

#1,521,317

Replying to VisceralExperience (#1,521,177)

What? "Meta-optimization" is not a very anthropomorphic term, and certainly not something laymen would understand. Their approach is technical in nature and describes the limitations of current models in explicit detail.

currentscurrents OP t1_j623hb4 wrote on January 27, 2023 at 3:57 AM

#1,521,648

Replying to robdogcronin (#1,521,197)

Yeah, but I want AI now. Not in 40 years when computers are 1000x better.

Also I'm not sure computers will be 1000x better in 40 years, Moore's law isn't what it used to be.

VisceralExperience t1_j623jjy wrote on January 27, 2023 at 3:58 AM

#1,521,653

Replying to currentscurrents (#1,521,317)

"secretly" is what I was referring to

endless_sea_of_stars t1_j627a9m wrote on January 27, 2023 at 4:30 AM

#1,522,074

Replying to DigThatData (#1,521,196)

Just rent out an AWS region for a month and you'll be good to go. Hold a couple bake sales to defray the cost.

currentscurrents OP t1_j627rd0 wrote on January 27, 2023 at 4:34 AM

#1,522,151

Replying to master3243 (#1,520,791)

Meh, transformers have been around for like 5 years and nobody figured this out until now.

I think this mostly speaks to how hard it is to figure out what neural networks are doing. Complexity is irrelevant to the training process (or any other optimization process), so the algorithms they implement are arbitrarily complex.

(or in practice, as arbitrarily complex as the model size and dataset size allow)

ElectronicCress3132 t1_j629tix wrote on January 27, 2023 at 4:53 AM

#1,522,353

Replying to currentscurrents (#1,512,995)

> implement a gradient descent optimization process at inference time

Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?

Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.

master3243 t1_j62aoln wrote on January 27, 2023 at 5:01 AM

#1,522,441

Replying to currentscurrents (#1,522,151)

You're right they've been around for 5 years (and the idea for attention even before that) but almost every major conference still has new papers coming out giving more insight into transformers (and sometimes algorithms/methods older than it)

I just don't want to see titles flooded with terms like "secretly" or "hidden" or "mysterious", I feel it replaces scientific terms with less scientific but more eye-catchy ones.

Again I totally understand why they would choose this phrasing, and I probably would too, but in a blog post title not a research paper title.

But once again, the actual work seems great and that's all that matters really.

currentscurrents OP t1_j62auto wrote on January 27, 2023 at 5:02 AM

#1,522,453

Replying to rjromero (#1,521,057)

Yes, but I don't want to create too much optimism; meta-learning was also a promising lead when Schmidhuber wrote his PhD thesis.

Honestly, I'm not sure much has changed since then other than we got more compute power. Transformers are reportedly equivalent to 1990s meta-learning networks except that they run better on GPUs, and GPUs have gotten powerful enough to run them at very large scale.

lookatmetype t1_j62j0t3 wrote on January 27, 2023 at 6:27 AM

#1,523,320

Replying to currentscurrents (#1,522,453)

is there anything he hasn't done?

Acceptable-Cress-374 t1_j62qh5g wrote on January 27, 2023 at 7:58 AM

#1,524,009

Replying to ElectronicCress3132 (#1,522,353)

Thank you for putting it into words, I was having trouble understanding this myself.

Acceptable-Cress-374 t1_j62ql8t wrote on January 27, 2023 at 8:00 AM

#1,524,021

Replying to lookatmetype (#1,523,320)

Stable diffusion with proper hands? :)

cthorrez t1_j63uc5a wrote on January 27, 2023 at 3:01 PM

#1,528,814

I have an issue with the experiments.

> For ICL, we fix the number of demonstration examples to 32 and tune the random seed for each task to find a set of demonstration examples that achieves the best validation performance. For finetuning, we use the same demonstration examples for ICL as the training examples and use SGD as the optimizer

They go through a set of random seeds to pick the "best" possible samples for in context learning, and then use the same set of examples for fine tuning. I think this biases the results in favor of in context learning.

A more fair way to do this would be to use a truly random set of examples, or to use use the same approach and tune the seed to find the "best" set of examples for finetuning as well.

EarthquakeBass t1_j64jhk3 wrote on January 27, 2023 at 5:41 PM

#1,532,283

Replying to currentscurrents (#1,521,648)

https://en.m.wikipedia.org/wiki/Huang%27s_law

A bit of marketing flair for sure, but I think at the crossroads of hardware improvements, ensembling, clever optimizations etc. we will keep improving models at a pretty darn fast pace. GPT-3 alone dramatically has improved the productivity of engineers, I’m sure of it.

currentscurrents OP t1_j658kmf wrote on January 27, 2023 at 8:18 PM

#1,535,946

Replying to cthorrez (#1,528,814)

Interesting. That probably explains why ICL outperformed finetuning by so much in their experiments.

Complex_Candidate_28 t1_j675z5i wrote on January 28, 2023 at 4:52 AM

#1,545,767

Replying to cthorrez (#1,528,814)

The purpose of the experiments is not to compare the performance between them. The goal is to compare the mechanisms behind them. So it doesn't affect the conclusion itself. The point is to use the same set of examples for analysis.

cthorrez t1_j67aa39 wrote on January 28, 2023 at 5:36 AM

#1,546,379

Replying to Complex_Candidate_28 (#1,545,767)

If the goal is the mechanism rather than the performance why tune the seed for performance in the first place? The examples used doesn't change the mechanism.

Complex_Candidate_28 t1_j67aytx wrote on January 28, 2023 at 5:43 AM

#1,546,462

Replying to cthorrez (#1,546,379)

Because for small-size LMs, ICL is unstable, i.e., it sometimes degrades to classifying all examples into one category. The protocol tries to ensure analyzing ICL when it works well. (For much larger-size LMs, the performance variance would be much smaller, where this step can be ignored.)

cthorrez t1_j67csjx wrote on January 28, 2023 at 6:04 AM

#1,546,737

Replying to Complex_Candidate_28 (#1,546,462)

That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".

Is finetuning similarly unstable for small sized LMs?

Complex_Candidate_28 t1_j67cx4a wrote on January 28, 2023 at 6:05 AM

#1,546,766

Replying to cthorrez (#1,546,737)

Yes, the size also affects finetuning but much less sensitive.

throwaway2676 t1_j68vbfq wrote on January 28, 2023 at 4:08 PM

#1,555,205

Replying to currentscurrents (#1,521,648)

> Not in 40 years when computers are 1000x better.

It won't take anywhere near that long. We've barely scratched the surface of ASICs and analog matrix multiplication, which is where the real fun is going to begin.

throwaway2676 t1_j6d99fw wrote on January 29, 2023 at 3:06 PM

#1,581,322

Replying to currentscurrents (#1,512,995)

So shouldn't this mean we can train transformers using forward passes alone? It seems that it wouldn't be too difficult to derive an algorithm that updates the attention weights based on these results, but I don't believe the authors mention the possibility.