Luminite2 t1_j62kcmp wrote on January 27, 2023 at 6:42 AM

Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

Your tl;dr is a bit circular. English has the highest compression ratio because the tokenizer was trained to optimize compression on mostly English data. One could train a BPE-based tokenizer that compresses some other language really well but works poorly on English if that made sense for the intended application.