Submitted by MazenAmria t3_zhvwvl in deeplearning
suflaj t1_izoh23q wrote
As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)
So, take at least the small version.
MazenAmria OP t1_izonquh wrote
That's sad; I'm starting to believe that this research idea is impractical or, maybe more accurately, overly ambitious.
suflaj t1_izorabe wrote
I don't think it's SWIN per se. I think the detectors (which take 5 feature maps of different level of detail) are incompatible with the 4 blocks of transformers which lack the spatial bias convolutional networks provide and the Tiny model being too small.
Other than that, pretraining (near) SOTA models is impractical for anyone other than big corpo for quite some time now. But you could always try asking your mentor for your uni's compute - my faculty offered GPUs ranging from 1080Tis to A100s.
Although I don't realize why you insist on pretraining SWIN, many SWIN models pretrained on ImageNet are already available. So you just have to do the distillation part on some part of the pretraining input distribution. Not only offered as part of MMCV, but Huggingface as well.
MazenAmria OP t1_izrgk9j wrote
I'm already using a pretrained model as the teacher model. But the distillation part itself has nearly the cost of training a model. I'm not insisting but I feel like I'm doing something wrong and needed some advices (note that I've only had theoritical experience in such areas of research, this is the first time I'm doing it practically).
Thanks for you comments.
suflaj t1_izruvvi wrote
That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.
Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.
MazenAmria OP t1_izt68w9 wrote
I'm using with torch.no_grad():
when calculating the output of the teacher model.
suflaj t1_iztjolh wrote
Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.
As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?
Viewing a single comment thread. View all comments