Viewing a single comment thread. View all comments

chaosmosis t1_j4mjxh9 wrote

Are the 77 token embedding vectors just concatenated together as ClipText's output? Is there any structure to their ordering as processed by the Image Information Creator? Assuming a trained model, would permuting the vectors' order before passing them forward to the next subcomponent break anything?

General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

1

juniperking t1_j4mma6c wrote

>General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

probably the most important thing that makes model configurations like this work is that they're very large and generalizable. a lot of prior research often focuses on finetuning for a specific task or dataset but the fact that clip (for example) is able to learn generalized text + image embeddings across multiple domains helps downstream training work

3