Viewing a single comment thread. View all comments

Reasonable_Boss2750 t1_is97cn3 wrote

Possible reason why the author uses attention with Wq and Wk is to fuse information in both encoder and decoder. In that case the formula is (XenWq)(XdeWk)t

1