Submitted by madmax_br5 t3_10mbct5 in MachineLearning
madmax_br5 OP t1_j62b2jq wrote
Reply to comment by float16 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.
So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?
visarga t1_j67q45m wrote
The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.
Viewing a single comment thread. View all comments