Submitted by soraki_soladead t3_zmoxp7 in MachineLearning
I remember reading a paper a while back that showed early attention layers in a transformer could be replaced with a simpler mechanism since most heads only modeled small n-grams. I think they used some kind of pooling?
Wondering if anyone knows which paper that was and had any thoughts about it since then. Thanks!
Rabrg t1_j0ck8uj wrote
A Mathematical Framework for Transformer Circuits