Hey people,

So. I wanted to find out if there is a way to determine when to early stop a training job.

See let's say I'm running the job for 100 epochs, the graph between training and validation accuracy and training and validation loss flattens at about 91% leading to drumroll over fitting! (Obviously).

Now apart from dropout layer, I'm using early stopping. But the issue is, I'm kind of concerned that it finds a local minimum and stops execution.

I'm using validation loss BTW for early stopping.

Comments

You must log in or register to comment.

_Arsenie_Boca_ t1_iyuu6le wrote on December 4, 2022 at 9:27 AM

I usually prefer checkpointing over early stopping, i.e. you always save a checkpoint when you get a better validation score. Loss is typically a good indicator, but if you have a more specific measure that you are aiming for(like downstream metrics), you should use that.

CrazyCrab t1_izg3hw6 wrote on December 8, 2022 at 9:16 PM

Recently, I have overfit to the validation dataset by doing this. The task is semantic segmentation. I trained for a very long time and I took the model with the best validation loss. Well, I got 0.02 nats/pixel cross entropy on val and 0.04 on train, 14% iou on val vs 24% on train.

_Arsenie_Boca_ t1_izg7d49 wrote on December 8, 2022 at 9:41 PM

Not sure how this indicates overfitting on the validation set? Wouldnt this be indicated by much worse performance on test compared to validation set? Havent done a lot of image segmentation work, is this specific to the task?

CrazyCrab t1_izg96t3 wrote on December 8, 2022 at 9:54 PM

I don't have a test set. It's not specific to a task.

_Arsenie_Boca_ t1_izgbbrj wrote on December 8, 2022 at 10:08 PM

Then how can you tell if you overfitted on the validation set?

CrazyCrab t1_izgcu6i wrote on December 8, 2022 at 10:19 PM

Ok, so my annotated data consists of about 50 images of size 10000x5000 pixels on average. The task is binary segmentation. Positives constitute approximately 8% of all pixels. 38 images are in the training part, 12 images are in the test part (I divided them randomly).

The batch cross entropy plot and the validation cross entropy plot were crazy unstable during training. After a little bit of training there mostly wasn't any stable trend in either going up or down. However, as the time went on, the best validation cross entropy over all checkpoints went down and went down...

So I think my checkpoint-selecting method gave me a model overfit to the validation dataset. I.e., I expect that on future samples the performance will be more like on the training dataset than on the validation dataset. The only other likely explanation I can think of is that I got unlucky and my validation dataset turned out to be significantly easier than my training dataset.

ppg_dork t1_iyue984 wrote on December 4, 2022 at 5:56 AM

I think the answer is going to depend heavily on your domain. For the types of datasets I typically use, I will keep the training going for a while, drop the learning rate once or twice if the validation loss keeps going down, and eventually stop when overfitting begins or loss plateaus.

However, I typically work on (relatively) simple CV problems. I've heard from colleagues that they sometimes train well beyond validation loss beginning to increase as the loss will eventually drop back down. However, this seems more common with RL or generative models. I'll reiterate that I'm regurgitating what others have mentioned.

If your goal ISN'T to shoot for SOTA, I would look at the "best" examples of applied literature in a relevant field. I've often found that those papers provide recipes that transfer well to similar problems. Whereas the SOTA-setting papers tend to be quite specialized to ImageNet/CIFAR-whatever/etc.

idkname999 t1_iyuilcz wrote on December 4, 2022 at 6:47 AM

validation loss flattening doesn't necessary mean overfitting. It is only fitting when it starts to increase. Yes, I experienced instances where early stopping (or early drop of learning rate) lead to less optimal solution. Though, basically every solution will be a local minimum, so you shouldn't worry about that as much.

VirtualHat t1_iyv1t3r wrote on December 4, 2022 at 11:19 AM

Traditionally you stop when the validation loss starts to increase. However, modern thinking is now to train very large models until ~~zero~~ (very low) training loss and then keep going. Eventually, your model will start to generalize. A sort of double descent on training time.

https://arxiv.org/pdf/2002.08709.pdf

Oceanboi t1_iyvf94s wrote on December 4, 2022 at 1:58 PM

This confuses me. I have almost never gotten an audio or image model over 90% accuracy - and it seems to be largely problem domain and data dependent. I've never been able to reach very low training loss. Does this only apply to extremely large datasets?

If I had a model that had 90% train and test accuracy, is that not a really good and adequately fit model for any real business solution? Obviously if it was a model that predicted a severe consequence, we'd want closer to 100%, but that's the IDEAL, right?

VirtualHat t1_iyz96uh wrote on December 5, 2022 at 7:47 AM

If you make your model large enough, you will get to 100%. In fact, not only can you get to 100% accuracy, but you can also get train loss to effectively 0. The paper I linked above discusses how this previously was considered a very bad idea, but if done carefully can actually improve generalization.

Probably the best bet though is to just stick to the "stop when validation goes up" rule.

Oceanboi t1_iz2bc3h wrote on December 5, 2022 at 11:07 PM

Do you know if this is done by simply training for massive amounts of epochs and adding layers until you hit 100%?

I may still just be new, but I’ve never been able to achieve this in practice. I’d be really interested in practical advice on how to overfit your dataset. I still unsure of the results of that paper you linked, I feel like I am misinterpreting it in some way.

Is it really suggesting overparametrizing something past the overfitting point and continuing to train will ultimately yield a model that generalizes well?

I am using a data set of 8500 sounds, 10 classes. I cannot push past 70-75% accuracy and the more layers I add to the Convolutional base, the lower my accuracy becomes. Are they suggesting the layers be added to the classifier head only? I’m all for overparametrizing a model and leaving it on for days, I just don’t know how to be deliberate in this effort.

VirtualHat t1_iz2qj72 wrote on December 6, 2022 at 12:59 AM

Yes, massive amounts of epochs with an overparameterized model. As mentioned, I wouldn't recommend it, though. It's just interesting that some of the intuition about how long to train for is changing from "too much is bad" to "too much is good".

If you are inserted in this subject, I'd highly recommend https://openai.com/blog/deep-double-descent/ (which is about overparameterization), as well as the paper mentioned above (which is about over-training). Again - I wouldn't recommend this for your problem. It's just interesting.

It's also worth remembering that there will be a natural error rate for your problem (i.e. does X actually tell us what y is). So it is possible that 70-75 test accuracy is the best you can do on your problem.

Oceanboi t1_iz2tn86 wrote on December 6, 2022 at 1:24 AM

Is this natural error rate purely theoretical or is there some effort to quantify a ceiling?

If I’m understanding correctly, you’re saying there is always going to be some natural ceiling to accuracy for some problems in which the X data doesn’t hold enough information to perfectly predict Y, or in nature just doesn’t help us predict Y?

VirtualHat t1_iz3li4d wrote on December 6, 2022 at 5:13 AM

https://en.wikipedia.org/wiki/Mutual_information

eigenlaplace t1_iz7dq71 wrote on December 7, 2022 at 12:37 AM

there are problems where the target is not ideal, but it is noisy instead due to the rater being imperfect

so if you get 100% accuracy on test set, you might just be predicting wrong things because another, more experienced, rater would judge the ground truth to be different than what the first rater said

this is in fact true for most, if not all, data, except for toy/procedural datasets where you actually create the input-output pairs deterministically

Lucas_Matheus t1_iyxfcvy wrote on December 4, 2022 at 10:15 PM

To me this seems more related to the early-stopping parameters. Important questions are:

What's the minimal percentage drop in validation loss you accept? If it's too high (20%), you don't train much. If it's too low (0.05%), it won't stop training.
What interval of validations are you using? If you check for earlystop every validation, an erratic loss may make the check inconsistent. If it takes too long to check again, the model may already be overfitting