Submitted by WigglyHypersurface t3_10jka1r in MachineLearning

One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this?

18

Comments

You must log in or register to comment.

WigglyHypersurface OP t1_j5l3vlk wrote

I have - the whole point of my post is this limits information sharing across tokens, depending on the split.

So, for example, if the tokenizer splits the -ed off the end of a rare verb - like "refactored" but does not for a common verb, like "calmed" it splits representations for the verbal morphology into two, when really those -ed endings serve the same function.

5

WigglyHypersurface OP t1_j5l49mq wrote

Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.

Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.

1

terath t1_j5l8t4k wrote

Oh I see what you mean. I remember that there were some character level language models, but they fell out of favour for subwords as I think the accuracy difference wasn't enough to justify the extra compute required for the character level.

Reviewing the fast text approach, they still end up hashing the character-ngrams rather then training an embedding for each. This could introduce the same sorts of inconsistencies that you're observing. That said, the final fast text embeddings are already the sum of the character embeddings, so I'm not clear on how your approach is different than just using the final fast text embeddings.

3

WigglyHypersurface OP t1_j5ldsn7 wrote

The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.

1

suflaj t1_j5mnzq3 wrote

Why would this matter?

If such examples are present in the training set and adequately expressed, then the model will learn whatever it needs to learn from those words.

If they are not in the training set, you should not expect the model to understand them the same way you do.

I realize this defeats the point of generalization, but LLMs learn to mimic generalization through exposure, not by actually learning to understand the underlying principles. These models do not analyze text like we humans do, but they have been shown to outperform the average human despite that.

Ultimately to do what you are doing you would need to have a tokenizer that has all the syntactical knowledge embedded within itself for a given subset of the language that will be the input. Wasn't AlexNet, a decade ago, enough to convince you to always relegate these kinds of tasks to the DL model, which will always beat a human provided it has the capacity and the data?

0