RShuk007 t1_izpxx5b wrote on December 10, 2022 at 11:36 PM

InsightFace uses resnet50-100 or ViT-B/L as best performance, that's a deep model that understands a lot of things. It seems because of the lack of synthetic cartoons in the training, the model does not learn whether face is human but instead whether face has human proportion/shape/topography?

You can check this out by implementing

https://arxiv.org/abs/2110.11001

https://arxiv.org/abs/1610.02391

On your models. These papers come under explainable ai, a field that tries to explain where the models look at to make decisions for the final decisions. In this case I can see it looks at the T region and mouth to make decisions, when occluded it only looks at the T region with the eyes, lower than usual resolution of real images does not seem to change the attention of the model. This indicates a lacking of texture and understanding of human face texture and details

I can see this using a custom package I developed for my work, however I can't show the results here due to confidentiality.

RShuk007 t1_izpydka wrote on December 10, 2022 at 11:40 PM

A simple retraining (fine-tune for fewer epochs only for parameters of the later classifier layers) will probably do the trick, I believe the encoder is still good and you can keep the backbone (resnet50 or ViT) frozen .

abhijit1247 OP t1_izr0f8f wrote on December 11, 2022 at 5:01 AM

This is a great insight. Thanks for the help.