Submitted by Dartagnjan t3_zqtmf7 in MachineLearning
I have a rather successful model which I have trained to an extent that the loss has now plateaued. The loss over my training dataset follows a Power Law type curve:
That means, 80% of the training examples have a loss which is well below my tolerance threshold. 15% have a loss which is slightly above threshold tolerance. 4% have a loss which is significant above threshold. And 1% have a very high loss.
This results from the inherent complexity of the training examples themselves. Some are simple. Some are complex. And I was wondering, are there any techniques developed to keep optimizing a model when you encounter such a situation? I thought, such a situation is surely very common so maybe some people came up with some strategies or algorithms, but my Google-fu has failed me. Please refer me to literature on the topic if it exists.
So far I have tried pre-selecting and training on the hard examples only and I have tried multiplying the loss gradients with a scalar that depends on the loss itself. None of these approaches give me satisfactory results.
Maybe it is just that the model is not complex enough. But I am maxing out my GPU RAM already (Nvdia A100s) so I cannot really do much better. But I am not sure I have yet reached the limits of complexity with this model.
dumbmachines t1_j0ztjdq wrote
Have you tried something like this?
You're not able to overfit on the hard examples alone? Why not?