Additional-Escape498 t1_j9w2ix6 wrote on February 24, 2023 at 11:40 PM

Reply to comment by FpRhGf in What are the big flaws with LLMs right now? by fangfried

LLM tokenization uses wordpieces, not words or characters. This is standard since the original “Attention is All you Need Paper” that introduced the transformer architecture in 2017. Vocabulary size is typically between 32k - 50k depending on the implementation. GPT-2 uses 50k. They include each individual ASCII character plus commonly used combinations of characters. Documentation: https://huggingface.co/docs/transformers/tokenizer_summary

https://huggingface.co/course/chapter6/6?fw=pt

FpRhGf t1_j9xtnne wrote on February 25, 2023 at 9:48 AM

Thanks! Well it's better than I thought. It still doesn't fix the limitations for the outputs I listed, but at least it's more flexible than what I presumed.

Additional-Escape498 t1_j9yorbo wrote on February 25, 2023 at 3:21 PM

You’re definitely right that it can’t do those things, but I don’t think it’s because of the tokenization. The wordpieces do contain individual characters, so it is possible for a model to do that with the wordpiece tokenization it uses, but the issue is that the things you’re asking for (like writing a story with pig Latin) require reasoning and LLMs are just mapping inputs to a manifold. LLM’s can’t really do much reasoning or logic and can’t do basic arithmetic. I wrote an article about the limitations of transformers if you’re interested: https://taboo.substack.com/p/geometric-intuition-for-why-chatgpt