Submitted by Dartagnjan t3_zqtmf7 in MachineLearning

I have a rather successful model which I have trained to an extent that the loss has now plateaued. The loss over my training dataset follows a Power Law type curve:

https://preview.redd.it/qotu2k09237a1.png?width=825&format=png&auto=webp&s=b16ca887ce8e259f8de4a20609e35ff7f7298df9

That means, 80% of the training examples have a loss which is well below my tolerance threshold. 15% have a loss which is slightly above threshold tolerance. 4% have a loss which is significant above threshold. And 1% have a very high loss.

This results from the inherent complexity of the training examples themselves. Some are simple. Some are complex. And I was wondering, are there any techniques developed to keep optimizing a model when you encounter such a situation? I thought, such a situation is surely very common so maybe some people came up with some strategies or algorithms, but my Google-fu has failed me. Please refer me to literature on the topic if it exists.

So far I have tried pre-selecting and training on the hard examples only and I have tried multiplying the loss gradients with a scalar that depends on the loss itself. None of these approaches give me satisfactory results.

Maybe it is just that the model is not complex enough. But I am maxing out my GPU RAM already (Nvdia A100s) so I cannot really do much better. But I am not sure I have yet reached the limits of complexity with this model.

107

Comments

You must log in or register to comment.

dumbmachines t1_j0ztjdq wrote

Have you tried something like this?

You're not able to overfit on the hard examples alone? Why not?

20

trajo123 t1_j0zyd7i wrote

You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html

Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.

Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?

7

Dartagnjan OP t1_j103e5a wrote

Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...

> focal loss, hard-example mining

I think these are exactly the keywords that I was missing in my search.

5

Dartagnjan OP t1_j103ef6 wrote

  1. I have already tried my own version of selective backprob, but thanks for the link. this is exactly what I was looking for. I want to know how other people implement it and if I did something wrong.
  2. Overfitting on the hard examples is a test that I carried out already multiple times but not yet on the latest experiments. Thanks for reminding me of this. I guess from this I can infer whether my complexity is definitely too low, if I cannot overfit. If I can overfit. If I can overfit on the hard examples it does not mean the model is able to handle easy and hard examples at the same time, still.
15

carbocation t1_j103ehe wrote

Have you tried focal loss? If I’m reading you correctly it’s appropriate for this type of question, although if the hard samples are distributed evenly across classes it is probably not actually going to help. I don’t think you mention what type of problem you’re solving (classification, regression, segmentation, etc) so it’s hard to guess.

11

Dartagnjan OP t1_j105tp0 wrote

It's a regression problem, but I already tried something similiar. I scaled the loss according to how hard the example is which was derived from a hand crafted heuristic, but I did not get good results with it.

4

JustOneAvailableName t1_j107lf3 wrote

Perhaps something like keep track of harder data points and sample half the batch from that? What happened exactly when training on the hard examples only?

2

Dartagnjan OP t1_j108k4y wrote

That is what I have already done. So far, the loss just oscillates but remains high, which leads me to believe that either I am not training in the right way i.e. maybe the difference between the easy and hard training examples is too drastic to bridge. Or my model is just not capable of handing the harder examples.

1

solresol t1_j10nsf2 wrote

Epistemic status: I don't know what I'm talking about, and I know I'm not fully coherent. Be kind in replies.

I *think* that your data might not have a finite mean and finite variance. If so, then there's no obvious "best" regression at all. As you get more data, optimality will change. A different random subsample of data will lead to wildly different results.

I have done some research on problems like this in linguistic data, and I was able to do dirty stuff by swapping out the underlying metric so that the notion of where "infinity" was changed. But if you have real-valued data, I don't think this can help.

−4

-Rizhiy- t1_j10nstz wrote

Can you collect more data similar to hard examples?

People like to focus on the architecture or training techniques, but most real problems can be solved by collecting more relevant data.

If the loss remains high even after getting more data, two potential problems come to mind:

  • There is not enough information in your data to correctly predict the target.
  • Your model is not complex/expressive enough to properly estimate the target.
13

techni_24 t1_j123prp wrote

Maybe this is the novice in me showing, but how does minimizing the batch size to 1, effect the model performance? I thought it only effected the speed of training.

1

FreddieM007 t1_j12qcmz wrote

Since your current model is perhaps not complex or expressive enough and vram is limited: have you tried building a classification model first that partitions the data in 2 classes? What is the quality there? Then you can build separate regression models for each class each using all vram.

1

trajo123 t1_j13gu3z wrote

Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.

1

solresol t1_j16ed0l wrote

As a real-world example that I encountered with a client that sells software to ecommerce stores... they wanted to know the number of products in a typical ecommerce store.

It turns out that there's a power law at work. If you sample N stores and count the number of products in all stores in total, you get XN products. Great! The mean is X.

But if you sample 2N stores, the number of products in total in all the stores is 4XN. That's because you have doubled your chances of finding a store that on its own has 2XN products, and the rest of the stores contribute the 2XN that you would have expected.

When you only sampled N stores, the average number of products per store was X. When you doubled the size of the sample, the average number of products was 2X.

Similar things happen to the variance.

As you increased the sample size, the average number of products goes up as well.

In a sane universe you would expect this to end eventually. This particular client is still growing, still analysing more stores, and they are always finding bigger and bigger stores (stores which on their own have more products than all other stores put together). Eventually they will have analysed every store in the world, and then they will be able to answer the question of "what's the average number of products in an ecommerce store that exists right now?"

But who knows? Maybe stores are being created algorithmically. It wouldn't surprise me. Certainly there will be more ecommerce stores in the future, so we probably can't say "that average number of products in an ecommerce store over all time?"

Anyway, the punchline is, you can't sample this data to find out the mean nor can you find its variance.

The original poster is finding that his residuals follow a power law. Depending on how steep the exponent is, it's possible that there is no well-defined mean for his residuals: as he collects more data, his mean will go up in proportion to the number of data points. If he is defining his loss function in terms of the mean of the residuals (or anything along those lines) then gradient descent is going to have some unresolvable[*] problems. If this is true, gradient descent will take his parameters on an exciting adventure through fractal saddles, where there's always a direction where it can reduce the loss function that makes no improvement to the majority of his data.

This looks to me like what is happening to him.

[*] Unresolvable with the state of the art at the moment AFAICT. I'm going to put this on my PhD research to do list.

1