cthorrez
cthorrez OP t1_iqn3vy4 wrote
Reply to comment by percevalw in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.
The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.
cthorrez OP t1_iqmvcze wrote
Reply to comment by jesuslop in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
Exactly, lots of people use gelu now. (A more expensive version which utilizes a Gaussian distribution...)
cthorrez OP t1_iqmv564 wrote
Reply to comment by mocny-chlapik in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.
Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.
cthorrez OP t1_iqlrf1v wrote
Reply to comment by its_ean in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
I'm not necessarily saying it should be replaced in every layer but I think it would at least make sense to investigate other options for final probability generation. tanh is definitely good for intermediate layer activation.
cthorrez OP t1_iqlr7nr wrote
Reply to comment by ClearlyCylindrical in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
Derivative of log of any CDF is also nice. Derivative of log CDF(x) = PDF(x)/CDF(x).
Plus we have autograd these days. Complicated derivatives can't hold us back anymore haha.
cthorrez OP t1_iqnk270 wrote
Reply to comment by chatterbox272 in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
Well it's an even worse final output activation for binary classification because the outputs are -1 to 1 not 0 to 1.
I've never seen it used as anything but an internal activation.