I remember reading a paper a while back that showed early attention layers in a transformer could be replaced with a simpler mechanism since most heads only modeled small n-grams. I think they used some kind of pooling?

Wondering if anyone knows which paper that was and had any thoughts about it since then. Thanks!

Comments

You must log in or register to comment.

Rabrg t1_j0ck8uj wrote on December 15, 2022 at 5:41 PM

A Mathematical Framework for Transformer Circuits

soraki_soladead OP t1_j0cmgzs wrote on December 15, 2022 at 5:55 PM

Perfect. Thank you! That explains why I couldn't find it.

EDIT: Spoke too soon. I think this covers some of the same ideas but it isn't the one I'm remembering. There's no method for simplifying the earlier layers of the transformer and exploiting the fact that they primarily learn bigrams. I could have sworn I read about it in an arxiv or openreview paper.

prohitman t1_j0dory1 wrote on December 15, 2022 at 10:02 PM

This is a really interesting article!

aps692 t1_j0dathj wrote on December 15, 2022 at 8:31 PM

Is this the one? SkipBERT

soraki_soladead OP t1_j0desco wrote on December 15, 2022 at 8:56 PM

Reading through it now. It was on my reading list but it doesn’t look familiar.

Axel-Blaze t1_j0ekbn8 wrote on December 16, 2022 at 1:48 AM

I saw that you found the paper that you were looking for but this one is also an interesting read on the same topic https://arxiv.org/abs/1905.09418

Might be worth checking out

soraki_soladead OP t1_j0el3yt wrote on December 16, 2022 at 1:54 AM

Thanks, I’ll take a look!

2600_yay t1_j0fc7j4 wrote on December 16, 2022 at 5:40 AM

"Are Neighbors Enough"'s authors swap out self-attention in a Transformer for a multi-head neural n-gram model? Perhaps that's what you're looking for?

https://arxiv.org/abs/2207.13354

soraki_soladead OP t1_j0gemzb wrote on December 16, 2022 at 1:30 PM

It isn’t but interesting paper!