Submitted by cthorrez t3_xsq40j in MachineLearning
cthorrez OP t1_iqlr7nr wrote
Reply to comment by ClearlyCylindrical in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
Derivative of log of any CDF is also nice. Derivative of log CDF(x) = PDF(x)/CDF(x).
Plus we have autograd these days. Complicated derivatives can't hold us back anymore haha.
mocny-chlapik t1_iqlt8jo wrote
It's about the speed of computation, not about the complexity of definition. If you need to calculate the function million or even billion times for each sample, it makes sense to optimize it.
cthorrez OP t1_iqmv564 wrote
I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.
Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.
Viewing a single comment thread. View all comments