trajo123 t1_j0zyd7i wrote on December 20, 2022 at 5:45 PM

You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html

Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.

Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?

Dartagnjan OP t1_j103e5a wrote on December 20, 2022 at 6:17 PM

Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...

> focal loss, hard-example mining

I think these are exactly the keywords that I was missing in my search.

dumbmachines t1_j133fcs wrote on December 21, 2022 at 8:46 AM

If focal loss is interesting, check out polyloss, which is a generalization of the focal loss idea.

techni_24 t1_j123prp wrote on December 21, 2022 at 2:37 AM

Maybe this is the novice in me showing, but how does minimizing the batch size to 1, effect the model performance? I thought it only effected the speed of training.

trajo123 t1_j13gu3z wrote on December 21, 2022 at 11:48 AM

Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.