Viewing a single comment thread. View all comments

mocny-chlapik t1_iqlt8jo wrote

It's about the speed of computation, not about the complexity of definition. If you need to calculate the function million or even billion times for each sample, it makes sense to optimize it.

22

cthorrez OP t1_iqmv564 wrote

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.

−1