cthorrez

cthorrez OP t1_iqn3vy4 wrote

Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.

The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.

4

cthorrez OP t1_iqmv564 wrote

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.

−1