cthorrez t1_j63uc5a wrote on January 27, 2023 at 3:01 PM

I have an issue with the experiments.

> For ICL, we fix the number of demonstration examples to 32 and tune the random seed for each task to find a set of demonstration examples that achieves the best validation performance. For finetuning, we use the same demonstration examples for ICL as the training examples and use SGD as the optimizer

They go through a set of random seeds to pick the "best" possible samples for in context learning, and then use the same set of examples for fine tuning. I think this biases the results in favor of in context learning.

A more fair way to do this would be to use a truly random set of examples, or to use use the same approach and tune the seed to find the "best" set of examples for finetuning as well.

currentscurrents OP t1_j658kmf wrote on January 27, 2023 at 8:18 PM

Interesting. That probably explains why ICL outperformed finetuning by so much in their experiments.

Complex_Candidate_28 t1_j675z5i wrote on January 28, 2023 at 4:52 AM

The purpose of the experiments is not to compare the performance between them. The goal is to compare the mechanisms behind them. So it doesn't affect the conclusion itself. The point is to use the same set of examples for analysis.

cthorrez t1_j67aa39 wrote on January 28, 2023 at 5:36 AM

If the goal is the mechanism rather than the performance why tune the seed for performance in the first place? The examples used doesn't change the mechanism.

Complex_Candidate_28 t1_j67aytx wrote on January 28, 2023 at 5:43 AM

Because for small-size LMs, ICL is unstable, i.e., it sometimes degrades to classifying all examples into one category. The protocol tries to ensure analyzing ICL when it works well. (For much larger-size LMs, the performance variance would be much smaller, where this step can be ignored.)

cthorrez t1_j67csjx wrote on January 28, 2023 at 6:04 AM

That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".

Is finetuning similarly unstable for small sized LMs?

Complex_Candidate_28 t1_j67cx4a wrote on January 28, 2023 at 6:05 AM

Yes, the size also affects finetuning but much less sensitive.