Submitted by WigglyHypersurface t3_10jka1r in MachineLearning
One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this?
terath t1_j5kz6tz wrote
Have you not heard of byte pair encoding? There are plenty of subword tokenizers and many language models are built on them.
Here is a quick article on them: https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0