thanks for the response. my main concern with beam sampling and backprop is the fact that context for the 2nd token will include 1st token. I believe in the RNN case, this wouldn't necessarily matter since only the hidden state is being propagated forward. In transformers, we have to completely redo the forward pass for 2nd token onwards and these subsequent forward passes don't have anything in common, so I'm a bit confused about how the gradients will flow exactly.
please let me know if I wasn't clear in explaining my problem. thanks again for your response :)
SaltyStackSmasher OP t1_jal0r2l wrote
Reply to comment by CMUOresama in [D] backprop through beam sampling ? by SaltyStackSmasher
thanks a lot for this. will definitely take a look