I am looking for papers that inject information into LMs directly using embeddings (without formatting information as text). I find it notoriously hard to search for these paper because they could come from various different domains, so I thought asking here might be a good option to reach people from many different domains.

Some examples I already found are from the domain of knowledge graph augmented LMs: ERNIE https://arxiv.org/abs/1904.09223 K-BERT https://arxiv.org/abs/1909.07606

Prefix Tuning / Prompt Tuning are also somewhat similar to the idea, but they dont depend on any external information.

Can you think of other papers that inject additional information into LMs via embeddings?

Comments

wittfm t1_j7jvhoc wrote on February 7, 2023 at 9:03 AM

#1,742,434

Maybe this can help https://www.youtube.com/live/FKsARHV3ZTI they mention the SeFit method which seems similar to what you are looking for.

wittfm t1_j7jvkoo wrote on February 7, 2023 at 9:05 AM

#1,742,438

Replying to wittfm (#1,742,434)

They mention it as an alternative to prompt engineering

_Arsenie_Boca_ OP t1_j7jxkxr wrote on February 7, 2023 at 9:34 AM

#1,742,508

Replying to wittfm (#1,742,434)

Thanks for the answer, but Im afraid the idea there is quite different. They take embeddings from LMs and finetune them, rather than aligning and injecting external embeddings.

PassingTumbleweed t1_j7lt1o5 wrote on February 7, 2023 at 6:57 PM

#1,746,250

Any LM with multimodal input? PaLI?

_Arsenie_Boca_ OP t1_j7miglb wrote on February 7, 2023 at 9:39 PM

#1,747,652

Replying to PassingTumbleweed (#1,746,250)

Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?

PassingTumbleweed t1_j7mlwls wrote on February 7, 2023 at 10:01 PM

#1,747,850

Replying to _Arsenie_Boca_ (#1,747,652)

I'm not aware of any comparison. Maybe it doesn't matter that much?

PaLI feeds embeddings from the Vision Transformer to the LM after a linear projection layer. It allows back propagation through ViTs weights so that the image encoding can be learned for the task. The ability to tune the embeddings in end-to-end fashion might be an important consideration.

edunuke t1_j7nyh34 wrote on February 8, 2023 at 3:57 AM

#1,750,776

I found this one under the keyword "embedding fusion" in llm:

https://arxiv.org/abs/2101.12294

It provides overview of many methods.

And as other said anything on multimodal fusion transformers.

_Arsenie_Boca_ OP t1_j7ommq8 wrote on February 8, 2023 at 8:17 AM

#1,751,804

Replying to PassingTumbleweed (#1,747,850)

Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.

dancingnightly t1_j7s355b wrote on February 9, 2023 at 12:25 AM

#1,757,813

In a sense, you can communicate between semantic text embeddings and LM models through this method(would operate differently to multi modal embeddings): https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight

This method, which is only practical for toy problems really right now, would allow you to use semantic embeddings to find what to look for when doing SVD on an (autoregressive) LM. You could depend this on the input, for example, transforming your embedding into the keys to apply the abduction with in that process, and impacting the generation of logits. I'm not sure this would behave much differently to altering the logit_bias of tokens, but it would be interesting to hear if it was.

[D] Papers that inject embeddings into LMs