cnapun t1_jai24sf wrote on March 1, 2023 at 4:20 PM

Reply to comment by SaltyStackSmasher in [D] backprop through beam sampling ? by SaltyStackSmasher

What I was trying to say was that doing this sampling approach (in a transformer) seems like it would have similar issues to a RNN, in that your computational graph will be repeated N times, where N is the rollout size. This makes me suspect that you'll get a lot of noise in your gradient estimates if N is large (also iirc Gumbel softmax gradients are biased, which might cause some more issues if chaining them)