Lugi
Lugi OP t1_iqodhe1 wrote
Reply to comment by VenerableSpace_ in [D] Focal loss - why it scales down the loss of minority class? by Lugi
Yes, but the problem here is while they mention that in the paper, finally they use alpha of 0.25, which weighs down the minority (foreground) - while the background (majority) class has scaling of 0.75. This is what I'm concerned about.
Lugi OP t1_iqoahos wrote
Reply to comment by VenerableSpace_ in [D] Focal loss - why it scales down the loss of minority class? by Lugi
Yes, but I am using specifically the alpha-balanced version, which they used in a counterproductive way.
Lugi OP t1_iqoa9pp wrote
Reply to comment by you-get-an-upvote in [D] Focal loss - why it scales down the loss of minority class? by Lugi
>The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).
They say it CAN be set like that, but they explicitly set it to 0.25. This is why I am confused, they put that statement in and did something completely opposite.
Lugi OP t1_iqndvf5 wrote
Reply to comment by Naive_Coconut_Cook in [D] Focal loss - why it scales down the loss of minority class? by Lugi
Nothing like this happens in the data, in object detection you cannot really address the imbalance by resampling the data, since the imbalance is inherent to the way targets are generated - you have a lot of 0 targets since you have a lot of background objects, and only a few positive objects to be detected.
Lugi t1_itiq8ir wrote
Reply to comment by rehrev in [D] What things did you learn in ML theory that are, in practice, different? by 4bedoe
>Otherwise, what does model complexity even mean?
People are generally referring to bigger models (#parameters) as more complex.
Come to think of it, redundancy in networks with more parameters can act as a regularizer, by making similar branches have essentially higher learning rate and be less prone to overfitting. Let me give you an example of what I have in mind: a simple network with just one parameter - y = wx. You can pass some data through it, calculate loss, backpropagate to get gradient, and update the weight with it.
But see what happens if we reparametrize w as w1+w2: the gradient for these is gonna be the same as in case of only one parameter, but after the weight update step we will essentially end up moving twice as far, which would be equal to original one parameter case with 2 times bigger learning rate.
Another thing that could be somehow linked to this phenomenon is that on one hand the parameter space of a 1 hidden layer neural network grows exponentially with the number of neurons, and on the other hand the number of equivalent minimums grows factorially, so at some certain number of neurons the factorial takes over and your optimization problem becomes much simpler, because you are always close to your desired minimum. But I don't know shit about high-dimensional math so don't quote me on that.