Submitted by madmax_br5 t3_10mbct5 in MachineLearning
float16 t1_j62agci wrote
Isn't this just the result of using certain tokenizers? Using Chinese as an example, no reasonable tokenizer developed with Chinese in mind would give you 17 tokens. You'd have maybe 6 to 8:
- 你好
- ,
- 我
- 是
- 个
- 高个子
...depending on whether it thinks 你好 and 高个子 should be split.
madmax_br5 OP t1_j62b2jq wrote
Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.
So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?
visarga t1_j67q45m wrote
The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.
Viewing a single comment thread. View all comments