Submitted by SaltyStackSmasher t3_11euzja in MachineLearning
cnapun t1_jai24sf wrote
Reply to comment by SaltyStackSmasher in [D] backprop through beam sampling ? by SaltyStackSmasher
What I was trying to say was that doing this sampling approach (in a transformer) seems like it would have similar issues to a RNN, in that your computational graph will be repeated N times, where N is the rollout size. This makes me suspect that you'll get a lot of noise in your gradient estimates if N is large (also iirc Gumbel softmax gradients are biased, which might cause some more issues if chaining them)
Viewing a single comment thread. View all comments