I am looking for papers that inject information into LMs directly using embeddings (without formatting information as text). I find it notoriously hard to search for these paper because they could come from various different domains, so I thought asking here might be a good option to reach people from many different domains.

Some examples I already found are from the domain of knowledge graph augmented LMs: ERNIE https://arxiv.org/abs/1904.09223 K-BERT https://arxiv.org/abs/1909.07606

Prefix Tuning / Prompt Tuning are also somewhat similar to the idea, but they dont depend on any external information.

Can you think of other papers that inject additional information into LMs via embeddings?

Comments

You must log in or register to comment.

wittfm t1_j7jvhoc wrote on February 7, 2023 at 9:03 AM

Maybe this can help https://www.youtube.com/live/FKsARHV3ZTI they mention the SeFit method which seems similar to what you are looking for.

wittfm t1_j7jvkoo wrote on February 7, 2023 at 9:05 AM

They mention it as an alternative to prompt engineering

_Arsenie_Boca_ OP t1_j7jxkxr wrote on February 7, 2023 at 9:34 AM

Thanks for the answer, but Im afraid the idea there is quite different. They take embeddings from LMs and finetune them, rather than aligning and injecting external embeddings.

PassingTumbleweed t1_j7lt1o5 wrote on February 7, 2023 at 6:57 PM

Any LM with multimodal input? PaLI?

_Arsenie_Boca_ OP t1_j7miglb wrote on February 7, 2023 at 9:39 PM

Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?

PassingTumbleweed t1_j7mlwls wrote on February 7, 2023 at 10:01 PM

I'm not aware of any comparison. Maybe it doesn't matter that much?

PaLI feeds embeddings from the Vision Transformer to the LM after a linear projection layer. It allows back propagation through ViTs weights so that the image encoding can be learned for the task. The ability to tune the embeddings in end-to-end fashion might be an important consideration.

_Arsenie_Boca_ OP t1_j7ommq8 wrote on February 8, 2023 at 8:17 AM

Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.

edunuke t1_j7nyh34 wrote on February 8, 2023 at 3:57 AM

I found this one under the keyword "embedding fusion" in llm:

https://arxiv.org/abs/2101.12294

It provides overview of many methods.

And as other said anything on multimodal fusion transformers.

dancingnightly t1_j7s355b wrote on February 9, 2023 at 12:25 AM

In a sense, you can communicate between semantic text embeddings and LM models through this method(would operate differently to multi modal embeddings): https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight

This method, which is only practical for toy problems really right now, would allow you to use semantic embeddings to find what to look for when doing SVD on an (autoregressive) LM. You could depend this on the input, for example, transforming your embedding into the keys to apply the abduction with in that process, and impacting the generation of logits. I'm not sure this would behave much differently to altering the logit_bias of tokens, but it would be interesting to hear if it was.

CatalyzeX_code_bot t1_j7jsmy6 wrote on February 7, 2023 at 8:23 AM

Found relevant code at https://github.com/lonePatient/ERNIE-text-classification-pytorch + all code implementations here

Found relevant code at https://github.com/autoliuweijie/K-BERT + all code implementations here

To opt out from receiving code links, DM me