Submitted by madmax_br5 t3_10mbct5 in MachineLearning
Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements.
Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their description. This makes it far more expensive to prompt and generate in these languages, which is kind of unintentionally quite racist and eurocentric.
For example, let's take the following sentence and count the tokens used in multiple langauges:
Hello, I am a tall man: 7 tokens
(Chinese) 你好,我是个高个子: 17 tokens
(Japanese) こんにちは、私は背の高い男です: 21 tokens
(Hindi) हैलो, मैं एक लंबा आदमी हूँ: 41 tokens
(Korean) 안녕하세요 저는 키가 큰 남자입니다: 45 tokens
(Telugu) హలో, నేను పొడవాటి మనిషిని: 68 tokens!!!
Yes, it's about ten times as expensive to use GPT3 for Telugu. That isn't good, especially if we want to ensure equal access to this technology globally. More than 80 million people speak this language! This also means that besides the cost, the context-length for these languages is much shorter in practice, making practical applications lag years behind what's possible on european languages. Imagine if you only had 400 tokens total context to work with. That's what GPT3 with Telugu is like today.
However, this seems straightforward to fix. Unicode is merely a portability standard, it need not be the input mechanism for NLP models. Why not just preconvert from unicode into a different representation with a larger vocabulary (such as 18-bit) and use one code point per symbol, skipping the whole grapheme thing? It would seem to add negligible processing to the embedding and decoding step, which is a very small portion of overall compute compared to the attention mechanisms, which IIRC represent about 95% of the compute.
Is there some reason why increasing the token vocabulary size and moving away from unicode within the embedding stage would be problematic?
ww3ace t1_j624na0 wrote
I don’t think any modern SOTA language model uses Unicode for tokenization.