Viewing a single comment thread. View all comments

PassingTumbleweed t1_j7lt1o5 wrote

Any LM with multimodal input? PaLI?

2

_Arsenie_Boca_ OP t1_j7miglb wrote

Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?

1

PassingTumbleweed t1_j7mlwls wrote

I'm not aware of any comparison. Maybe it doesn't matter that much?

PaLI feeds embeddings from the Vision Transformer to the LM after a linear projection layer. It allows back propagation through ViTs weights so that the image encoding can be learned for the task. The ability to tune the embeddings in end-to-end fashion might be an important consideration.

3

_Arsenie_Boca_ OP t1_j7ommq8 wrote

Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.

1