Submitted by madmax_br5 t3_10mbct5 in MachineLearning

Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements.

Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their description. This makes it far more expensive to prompt and generate in these languages, which is kind of unintentionally quite racist and eurocentric.

For example, let's take the following sentence and count the tokens used in multiple langauges:

Hello, I am a tall man: 7 tokens

(Chinese) 你好,我是个高个子: 17 tokens

(Japanese) こんにちは、私は背の高い男です: 21 tokens

(Hindi) हैलो, मैं एक लंबा आदमी हूँ: 41 tokens

(Korean) 안녕하세요 저는 키가 큰 남자입니다: 45 tokens

(Telugu) హలో, నేను పొడవాటి మనిషిని: 68 tokens!!!

Yes, it's about ten times as expensive to use GPT3 for Telugu. That isn't good, especially if we want to ensure equal access to this technology globally. More than 80 million people speak this language! This also means that besides the cost, the context-length for these languages is much shorter in practice, making practical applications lag years behind what's possible on european languages. Imagine if you only had 400 tokens total context to work with. That's what GPT3 with Telugu is like today.

However, this seems straightforward to fix. Unicode is merely a portability standard, it need not be the input mechanism for NLP models. Why not just preconvert from unicode into a different representation with a larger vocabulary (such as 18-bit) and use one code point per symbol, skipping the whole grapheme thing? It would seem to add negligible processing to the embedding and decoding step, which is a very small portion of overall compute compared to the attention mechanisms, which IIRC represent about 95% of the compute.

Is there some reason why increasing the token vocabulary size and moving away from unicode within the embedding stage would be problematic?

6

Comments

You must log in or register to comment.

ww3ace t1_j624na0 wrote

I don’t think any modern SOTA language model uses Unicode for tokenization.

1

CKtalon t1_j625s3n wrote

The tokenization just saw a predominantly English corpus, so it naturally tokenised most common English words and left words from other languages in different sub word form.

They could increase the vocabulary size to something like 250000 from the current 30+k, but that would require retraining

8

gradientpenalty t1_j6278gc wrote

Its not a problem of unicode but the tokenizer method they are using BPE. I don't forsee any solution in the future cause there aren't many high paying customer

TLDR; english use the least token because it provides the highest compression ratio in bytes to token size.

9

madmax_br5 OP t1_j629re3 wrote

Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?

−3

float16 t1_j62agci wrote

Isn't this just the result of using certain tokenizers? Using Chinese as an example, no reasonable tokenizer developed with Chinese in mind would give you 17 tokens. You'd have maybe 6 to 8:

  1. 你好
  2. 高个子

...depending on whether it thinks 你好 and 高个子 should be split.

2

PassingTumbleweed t1_j62anc3 wrote

You could solve the problem you describe at the tokenization level without moving away from Unicode, which is more about how text is encoded for storage and transmission purposes.

For example let's say you still represent your text as Unicode at rest, but you have a tokenizer that budgets its vocab space s.t. the average number of tokens per sentence is the same across languages (or whatever your fairness criteria is)

2

madmax_br5 OP t1_j62anqr wrote

What would be the practical impacts of a larger vocabulary? There seems to ultimately be no way around this if you want a truly multilingual model; your vocabulary needs to be at least as large as the full set of symbols in all the languages in the corpus. But it would seem that the computational costs of this would be limited to the very beginning and very end of the model, which seems computationally insignificant compared to the attention layers that operate in vector space. In fact, doesn't a larger input vocabulary result in fewer net tokens to vectorize in the first place? If the vector space of the embedding has a fixed dimensionality (which I believe it does in the case of GPT3), then isn't each token the same mathematical size once embedded?

1

madmax_br5 OP t1_j62b2jq wrote

Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.

So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?

0

madmax_br5 OP t1_j62bm6c wrote

Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.

1

PassingTumbleweed t1_j62bzdk wrote

You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.

The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!

2

CKtalon t1_j62c6t5 wrote

GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.

1

Luminite2 t1_j62kcmp wrote

Your tl;dr is a bit circular. English has the highest compression ratio because the tokenizer was trained to optimize compression on mostly English data. One could train a BPE-based tokenizer that compresses some other language really well but works poorly on English if that made sense for the intended application.

6

crt09 t1_j631rr5 wrote

Here's 'Language modelling with pixels'! https://openreview.net/pdf?id=FkSp8VW8RjH It gets close to BERT in English peformance. It does better in other languages but that's probably only cause BERT wasn't trained much on them afaik. but still! its apparently much more viable thought

3

suflaj t1_j63bf1q wrote

Well for starters, it would probably have worse performance due to so many redundant features, and it would be much slower.

Remember that the embedding layer carries loads of overhead, as we're talking V * d matrices. So for a corpus of 250k and embedding vector of 768, ex., we're talking about 192M parameters just for the embedding layer. Maybe you can save some space by having a sparse embedder, but find me a free implementation of sparse layers that work as well as dense ones. Other than that, the 192M parameters are, before compression techniques, equivalent to 768M. And that's just in memory, and the gradient, unless sparsified, will be 768M PER BATCH.

This is without mentioning that you would likely need to increase the embedding dim to account for the 8 times times bigger vocabulary.

2

HateRedditCantQuitit t1_j647xm6 wrote

I'm not sure how long you've been around, but before BPE came along, large vocabularies were actually quite a pain in the ass. You can find lots of literature around it before maybe 2016 (can't remember exact dates to look and I'm feeling lazy).

IIRC, a big issue was the final prediction layer. Say you're predicting a sequence 4k tokens long. Then you have 4k times vocab-size predictions. With a 50k token vocab, that's 200M predictions in memory (roughly 1 gig with floats). Lets say we want to equally compress 20x more languages, so we get 1M tokens (speaking super duper roughly), which means nearly 20GB just to represent the logits. If we wanted to handle a 40k long sequence, it's the difference between 20GB and 200GB of logits.

That said, BPE just takes in sequences of more-simple tokens. If you want to feed it unicode, go ahead. If you want to feed it something else, that will work too. It seems like you're mostly frustrated that LLM investments are focused on english right now, which is valid. Tech investments in general have a strong silicon valley bias, and a zillion people want to recreate that elsewhere. But that's a very hard economic question.

1