[R]Wq can be omited in single head attention Submitted by wangyi_fudan t3_y2w87i on October 13, 2022 at 11:27 AM in MachineLearning 7 comments 17
Reasonable_Boss2750 t1_is97cn3 wrote on October 14, 2022 at 5:38 AM Possible reason why the author uses attention with Wq and Wk is to fuse information in both encoder and decoder. In that case the formula is (XenWq)(XdeWk)t Permalink 1
Viewing a single comment thread. View all comments