cthorrez OP t1_iqnk270 wrote on October 1, 2022 at 6:30 PM

Reply to comment by chatterbox272 in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

Well it's an even worse final output activation for binary classification because the outputs are -1 to 1 not 0 to 1.

I've never seen it used as anything but an internal activation.

cthorrez OP t1_iqn3vy4 wrote on October 1, 2022 at 4:37 PM

Reply to comment by percevalw in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.

The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.

cthorrez OP t1_iqmvcze wrote on October 1, 2022 at 3:36 PM

Reply to comment by jesuslop in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

Exactly, lots of people use gelu now. (A more expensive version which utilizes a Gaussian distribution...)

cthorrez OP t1_iqmv564 wrote on October 1, 2022 at 3:35 PM

Reply to comment by mocny-chlapik in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.

cthorrez OP t1_iqlrf1v wrote on October 1, 2022 at 8:51 AM

Reply to comment by its_ean in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

I'm not necessarily saying it should be replaced in every layer but I think it would at least make sense to investigate other options for final probability generation. tanh is definitely good for intermediate layer activation.

cthorrez OP t1_iqlr7nr wrote on October 1, 2022 at 8:48 AM

Reply to comment by ClearlyCylindrical in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

Derivative of log of any CDF is also nice. Derivative of log CDF(x) = PDF(x)/CDF(x).

Plus we have autograd these days. Complicated derivatives can't hold us back anymore haha.