chaosmosis t1_j4mjxh9 wrote on January 16, 2023 at 7:48 PM

Are the 77 token embedding vectors just concatenated together as ClipText's output? Is there any structure to their ordering as processed by the Image Information Creator? Assuming a trained model, would permuting the vectors' order before passing them forward to the next subcomponent break anything?

General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

juniperking t1_j4mma6c wrote on January 16, 2023 at 8:03 PM

>General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

probably the most important thing that makes model configurations like this work is that they're very large and generalizable. a lot of prior research often focuses on finetuning for a specific task or dataset but the fact that clip (for example) is able to learn generalized text + image embeddings across multiple domains helps downstream training work

blimpyway t1_j4pi713 wrote on January 17, 2023 at 10:19 AM

The order of the words/tokens is normally encoded via positional embeddings that are added each to their respective token embedding. See e.g. https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/