Submitted by fangfried t3_11alcys in singularity
Additional-Escape498 t1_j9w2ix6 wrote
Reply to comment by FpRhGf in What are the big flaws with LLMs right now? by fangfried
LLM tokenization uses wordpieces, not words or characters. This is standard since the original “Attention is All you Need Paper” that introduced the transformer architecture in 2017. Vocabulary size is typically between 32k - 50k depending on the implementation. GPT-2 uses 50k. They include each individual ASCII character plus commonly used combinations of characters. Documentation: https://huggingface.co/docs/transformers/tokenizer_summary
FpRhGf t1_j9xtnne wrote
Thanks! Well it's better than I thought. It still doesn't fix the limitations for the outputs I listed, but at least it's more flexible than what I presumed.
Additional-Escape498 t1_j9yorbo wrote
You’re definitely right that it can’t do those things, but I don’t think it’s because of the tokenization. The wordpieces do contain individual characters, so it is possible for a model to do that with the wordpiece tokenization it uses, but the issue is that the things you’re asking for (like writing a story with pig Latin) require reasoning and LLMs are just mapping inputs to a manifold. LLM’s can’t really do much reasoning or logic and can’t do basic arithmetic. I wrote an article about the limitations of transformers if you’re interested: https://taboo.substack.com/p/geometric-intuition-for-why-chatgpt
Viewing a single comment thread. View all comments