Submitted by madmax_br5 t3_10mbct5 in MachineLearning
madmax_br5 OP t1_j625fr2 wrote
Reply to comment by ww3ace in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.
Viewing a single comment thread. View all comments