[R]Wq can be omited in single head attention Submitted by wangyi_fudan t3_y2w87i on October 13, 2022 at 11:27 AM in MachineLearning 7 comments 17
mrfox321 t1_is7pudf wrote on October 13, 2022 at 10:16 PM Sure, but using W_q allows for low-rank representations of W := W_k @ W_q^T Permalink 5
Viewing a single comment thread. View all comments