Submitted by Lugi t3_xt01bk in MachineLearning

The equation of α-balanced focal loss (binary in this case for simplicity) is given by:

https://preview.redd.it/39hgb62728r91.png?width=718&format=png&auto=webp&s=8064189fe0dcd7dc4a04b24bd8acc837d12240ea

What puzzles me is that it seems like weighing used here is opposite to what is intuitive when dealing with imbalanced datasets: normally you would scale the loss of class 1 (minority - foreground objects in case of object detection) higher than the class 0 (majority - background). However what happens here is that we scale class 1 by 0.25, and class 0 by 0.75.

Is this behavior explained anywhere? I don't think I'm getting the foreground/background labels wrong, as I've looked into multiple implementations, as well as the original paper. Or maybe am I missing some crucial detail?

Paper for reference: https://arxiv.org/abs/1708.02002

7

Comments

You must log in or register to comment.

Lugi OP t1_iqndvf5 wrote

Nothing like this happens in the data, in object detection you cannot really address the imbalance by resampling the data, since the imbalance is inherent to the way targets are generated - you have a lot of 0 targets since you have a lot of background objects, and only a few positive objects to be detected.

1

you-get-an-upvote t1_iqnunxf wrote

The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

But also I want to take this moment to talk about focal loss.

The point of focal loss really isn't downweighting common classes. Note that the original definition of focal loss in the paper doesn't use α. The formula you give is the "α-balanced variant of focal loss" which the authors "adopt in [their] experiments as it yields slightly improved accuracy over the non-α-balanced form".

What focal loss does do is decrease the importance of "easy" examples on the loss -- that is, it decreases the importance of examples that the model gets very correct. When datasets are imbalanced, common classes tend to be "easy" in this sense.

For example, consider a class that is 99% classA and 1% classB. A trivial model will predict every datapoint has a 99% chance of being classA, which will result in a very low loss for classA datapoints and a very high loss for classB datapoints.

Note, though, that these are not the same thing, since the more common class doesn't have to be the easier one. Suppose I train a model on CIFAR10 but add an additional "image is a solid color" class. Even if this extra class has only 10% of the datapoints of the other classes, it's so easy to classify compared to the other classes that focal loss will assign it lower weight.

13

VenerableSpace_ t1_iqo5opu wrote

Focal loss downweights "well-classified" examples. It happens that the minority class typically is not well classified because in a given mini-batch the average gradient will be dominated by the majority class.

Technically focal loss downweights losses for all examples, it just happens to downweight the loss of well classified examples significantly more than non-well classified examples (I'm using this distinction between the two but its a smoother downweighting).

3

killver t1_iqo6z39 wrote

Alpha in focal loss has confused me and others before. I do not understand why they built their paper writeup so heavily around it, as it was not really the contribution of the paper.

I would suggest to use a non-alpha variant in your experiments, and only think about alpha as a common way of up/downscaling classes and add it later.

2

Lugi OP t1_iqoa9pp wrote

>The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

They say it CAN be set like that, but they explicitly set it to 0.25. This is why I am confused, they put that statement in and did something completely opposite.

3

VenerableSpace_ t1_iqocr2s wrote

the alpha term uses inverse class freq to downweight the loss. So if there is 3:1 ratio of majority:minority, alpha_majority = 0.25 and alpha_minority = 0.75.

1

Lugi OP t1_iqodhe1 wrote

Yes, but the problem here is while they mention that in the paper, finally they use alpha of 0.25, which weighs down the minority (foreground) - while the background (majority) class has scaling of 0.75. This is what I'm concerned about.

2

VenerableSpace_ t1_iqorbnu wrote

Ahh I see now, its been a while since I read that paper. So they chalk it down to the interaction between alpha and the focal term. You can see how they need to use a non-intuitive value for alpha when they introduce the focal loss term in tab. 1b. especially when gamma > 0.5

2

chatterbox272 t1_iqp67eq wrote

It is most likely because the focal term ends up over-emphasizing the rare class term for their task. The focal loss up-weights hard samples (most of which will usually be the rare/object class) and down-weights easy samples (background/common class). The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore. They inherit the nomenclature from cross entropy, but they use the term in a different way and are clear as mud about it in the paper.

6

I_draw_boxes t1_iqvuh8g wrote

>The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore.

This is it. The background in RetinaNet far exceeds foreground so the default prediction of the network will be background which generates very little loss per anchor in their formulation. Focal loss without alpha is symmetrical, but the targets and behavior of RetinaNet is not.

Alpha might be intended to bring up the loss for common negative examples to keep it in balance with foreground loss. It might also be intended to bring up the loss for false positives which are even more rare than foreground.

2