Viewing a single comment thread. View all comments

Additional-Escape498 t1_j9w2ix6 wrote

LLM tokenization uses wordpieces, not words or characters. This is standard since the original “Attention is All you Need Paper” that introduced the transformer architecture in 2017. Vocabulary size is typically between 32k - 50k depending on the implementation. GPT-2 uses 50k. They include each individual ASCII character plus commonly used combinations of characters. Documentation: https://huggingface.co/docs/transformers/tokenizer_summary

https://huggingface.co/course/chapter6/6?fw=pt

4

FpRhGf t1_j9xtnne wrote

Thanks! Well it's better than I thought. It still doesn't fix the limitations for the outputs I listed, but at least it's more flexible than what I presumed.

2

Additional-Escape498 t1_j9yorbo wrote

You’re definitely right that it can’t do those things, but I don’t think it’s because of the tokenization. The wordpieces do contain individual characters, so it is possible for a model to do that with the wordpiece tokenization it uses, but the issue is that the things you’re asking for (like writing a story with pig Latin) require reasoning and LLMs are just mapping inputs to a manifold. LLM’s can’t really do much reasoning or logic and can’t do basic arithmetic. I wrote an article about the limitations of transformers if you’re interested: https://taboo.substack.com/p/geometric-intuition-for-why-chatgpt

1