suflaj t1_j5mnzq3 wrote on January 24, 2023 at 2:07 AM

Reply to comment by WigglyHypersurface in [D] Embedding bags for LLMs by WigglyHypersurface

Why would this matter?

If such examples are present in the training set and adequately expressed, then the model will learn whatever it needs to learn from those words.

If they are not in the training set, you should not expect the model to understand them the same way you do.

I realize this defeats the point of generalization, but LLMs learn to mimic generalization through exposure, not by actually learning to understand the underlying principles. These models do not analyze text like we humans do, but they have been shown to outperform the average human despite that.

Ultimately to do what you are doing you would need to have a tokenizer that has all the syntactical knowledge embedded within itself for a given subset of the language that will be the input. Wasn't AlexNet, a decade ago, enough to convince you to always relegate these kinds of tasks to the DL model, which will always beat a human provided it has the capacity and the data?