I have a rather successful model which I have trained to an extent that the loss has now plateaued. The loss over my training dataset follows a Power Law type curve:

https://preview.redd.it/qotu2k09237a1.png?width=825&format=png&auto=webp&s=b16ca887ce8e259f8de4a20609e35ff7f7298df9

That means, 80% of the training examples have a loss which is well below my tolerance threshold. 15% have a loss which is slightly above threshold tolerance. 4% have a loss which is significant above threshold. And 1% have a very high loss.

This results from the inherent complexity of the training examples themselves. Some are simple. Some are complex. And I was wondering, are there any techniques developed to keep optimizing a model when you encounter such a situation? I thought, such a situation is surely very common so maybe some people came up with some strategies or algorithms, but my Google-fu has failed me. Please refer me to literature on the topic if it exists.

So far I have tried pre-selecting and training on the hard examples only and I have tried multiplying the loss gradients with a scalar that depends on the loss itself. None of these approaches give me satisfactory results.

Maybe it is just that the model is not complex enough. But I am maxing out my GPU RAM already (Nvdia A100s) so I cannot really do much better. But I am not sure I have yet reached the limits of complexity with this model.

Comments

You must log in or register to comment.

dumbmachines t1_j0ztjdq wrote on December 20, 2022 at 5:14 PM

#984,284

Have you tried something like this?

You're not able to overfit on the hard examples alone? Why not?

trajo123 t1_j0zyd7i wrote on December 20, 2022 at 5:45 PM

#984,662

You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html

Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.

Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?

Dartagnjan OP t1_j103e5a wrote on December 20, 2022 at 6:17 PM

#985,006

Replying to trajo123 (#984,662)

Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...

> focal loss, hard-example mining

I think these are exactly the keywords that I was missing in my search.

Dartagnjan OP t1_j103ef6 wrote on December 20, 2022 at 6:17 PM

#985,008

Replying to dumbmachines (#984,284)

I have already tried my own version of selective backprob, but thanks for the link. this is exactly what I was looking for. I want to know how other people implement it and if I did something wrong.
Overfitting on the hard examples is a test that I carried out already multiple times but not yet on the latest experiments. Thanks for reminding me of this. I guess from this I can infer whether my complexity is definitely too low, if I cannot overfit. If I can overfit. If I can overfit on the hard examples it does not mean the model is able to handle easy and hard examples at the same time, still.

carbocation t1_j103ehe wrote on December 20, 2022 at 6:17 PM

#985,009

Have you tried focal loss? If I’m reading you correctly it’s appropriate for this type of question, although if the hard samples are distributed evenly across classes it is probably not actually going to help. I don’t think you mention what type of problem you’re solving (classification, regression, segmentation, etc) so it’s hard to guess.

Dartagnjan OP t1_j105tp0 wrote on December 20, 2022 at 6:32 PM

#985,156

Replying to carbocation (#985,009)

It's a regression problem, but I already tried something similiar. I scaled the loss according to how hard the example is which was derived from a hand crafted heuristic, but I did not get good results with it.

JustOneAvailableName t1_j107lf3 wrote on December 20, 2022 at 6:43 PM

#985,290

Perhaps something like keep track of harder data points and sample half the batch from that? What happened exactly when training on the hard examples only?

Dartagnjan OP t1_j108k4y wrote on December 20, 2022 at 6:50 PM

#985,351

Replying to JustOneAvailableName (#985,290)

That is what I have already done. So far, the loss just oscillates but remains high, which leads me to believe that either I am not training in the right way i.e. maybe the difference between the easy and hard training examples is too drastic to bridge. Or my model is just not capable of handing the harder examples.

JustOneAvailableName t1_j1096lz wrote on December 20, 2022 at 6:53 PM

#985,390

Replying to Dartagnjan (#985,351)

Sounds like you need a higher batch size. What happens on a plateaued model on the hard examples when you take a huge batch size?

solresol t1_j10nsf2 wrote on December 20, 2022 at 8:27 PM

#986,473

Epistemic status: I don't know what I'm talking about, and I know I'm not fully coherent. Be kind in replies.

I *think* that your data might not have a finite mean and finite variance. If so, then there's no obvious "best" regression at all. As you get more data, optimality will change. A different random subsample of data will lead to wildly different results.

I have done some research on problems like this in linguistic data, and I was able to do dirty stuff by swapping out the underlying metric so that the notion of where "infinity" was changed. But if you have real-valued data, I don't think this can help.

-Rizhiy- t1_j10nstz wrote on December 20, 2022 at 8:27 PM

#986,474

Can you collect more data similar to hard examples?

People like to focus on the architecture or training techniques, but most real problems can be solved by collecting more relevant data.

If the loss remains high even after getting more data, two potential problems come to mind:

There is not enough information in your data to correctly predict the target.
Your model is not complex/expressive enough to properly estimate the target.

Zealousideal_Low1287 t1_j1111zo wrote on December 20, 2022 at 9:54 PM

#987,590

Perhaps you can do some data augmentation and resampling proportional to the difficulty.

Also perhaps a scheme like this could be appropriate:

https://arxiv.org/abs/2206.07137

techni_24 t1_j123prp wrote on December 21, 2022 at 2:37 AM

#990,562

Replying to trajo123 (#984,662)

Maybe this is the novice in me showing, but how does minimizing the batch size to 1, effect the model performance? I thought it only effected the speed of training.

junetwentyfirst2020 t1_j126tuh wrote on December 21, 2022 at 3:02 AM

#990,793

My first thought would be curriculum learning.

FreddieM007 t1_j12qcmz wrote on December 21, 2022 at 6:02 AM

#992,209

Since your current model is perhaps not complex or expressive enough and vram is limited: have you tried building a classification model first that partitions the data in 2 classes? What is the quality there? Then you can build separate regression models for each class each using all vram.

BoiElroy t1_j12qq57 wrote on December 21, 2022 at 6:06 AM

#992,233

!RemindMe 7 days

RemindMeBot t1_j12qrfl wrote on December 21, 2022 at 6:06 AM

#992,241

Replying to BoiElroy (#992,233)

I will be messaging you in 7 days on 2022-12-28 06:06:15 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

93-summer-days t1_j12yqb2 wrote on December 21, 2022 at 7:43 AM

#992,705

Distributional Robust Optimization

dumbmachines t1_j133fcs wrote on December 21, 2022 at 8:46 AM

#992,928

Replying to Dartagnjan (#985,006)

If focal loss is interesting, check out polyloss, which is a generalization of the focal loss idea.

trajo123 t1_j13gu3z wrote on December 21, 2022 at 11:48 AM

#993,610

Replying to techni_24 (#990,562)

Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.

[deleted] t1_j15e7wg wrote on December 21, 2022 at 8:08 PM

#998,884

Replying to solresol (#986,473)

[deleted]

solresol t1_j16ed0l wrote on December 22, 2022 at 12:19 AM

#1,001,909

Replying to [deleted] (#998,884)

As a real-world example that I encountered with a client that sells software to ecommerce stores... they wanted to know the number of products in a typical ecommerce store.

It turns out that there's a power law at work. If you sample N stores and count the number of products in all stores in total, you get XN products. Great! The mean is X.

But if you sample 2N stores, the number of products in total in all the stores is 4XN. That's because you have doubled your chances of finding a store that on its own has 2XN products, and the rest of the stores contribute the 2XN that you would have expected.

When you only sampled N stores, the average number of products per store was X. When you doubled the size of the sample, the average number of products was 2X.

Similar things happen to the variance.

As you increased the sample size, the average number of products goes up as well.

In a sane universe you would expect this to end eventually. This particular client is still growing, still analysing more stores, and they are always finding bigger and bigger stores (stores which on their own have more products than all other stores put together). Eventually they will have analysed every store in the world, and then they will be able to answer the question of "what's the average number of products in an ecommerce store that exists right now?"

But who knows? Maybe stores are being created algorithmically. It wouldn't surprise me. Certainly there will be more ecommerce stores in the future, so we probably can't say "that average number of products in an ecommerce store over all time?"

Anyway, the punchline is, you can't sample this data to find out the mean nor can you find its variance.

The original poster is finding that his residuals follow a power law. Depending on how steep the exponent is, it's possible that there is no well-defined mean for his residuals: as he collects more data, his mean will go up in proportion to the number of data points. If he is defining his loss function in terms of the mean of the residuals (or anything along those lines) then gradient descent is going to have some unresolvable[*] problems. If this is true, gradient descent will take his parameters on an exciting adventure through fractal saddles, where there's always a direction where it can reduce the loss function that makes no improvement to the majority of his data.

This looks to me like what is happening to him.

[*] Unresolvable with the state of the art at the moment AFAICT. I'm going to put this on my PhD research to do list.