Submitted by [deleted] t3_xz6j0s in MachineLearning
akore654 t1_irpd2ai wrote
Reply to comment by [deleted] in [D] The role of the quantization step in VQ-VAEs by [deleted]
The 1024 is the number of latent vectors in the codebook. So the 16x16 grid would be something like [[5, 24, 16, 850, 1002, ...]], but as a 16x16 grid of any combination of 1024 discrete codes.
Exactly, the codes are conditioned against each other. It's exactly the same setup as the way GPT-3 and other autoregressive LLMs are trained, in their case the discrete codes are the tokenized word sequences. For images, just flatten the grid and predict the next discrete code.
I guess that's the main intuition of this method, to unify generative language modeling and image modeling to be a set of discrete codes, so that we can model them using the same methods.
[deleted] OP t1_irpyqxr wrote
[deleted]
akore654 t1_irsttia wrote
If we use the language analogy, if you had a sequence of 100 words. Each of those words would come from a vocabulary of a certain size (~50,000 for english) words. So for a sequence of 100 words you can chose for each position in the sequence, any of those 50,000 words.
You can see how this explodes in terms of the number of unique combinations. it is the same thing for the 16x16 grid with a vocabulary of 1024 possible discrete vectors.
I'm not entirely sure what motivates it, I just know it's a fairly successful method for text generation. Hope that helps.
Viewing a single comment thread. View all comments