Viewing a single comment thread. View all comments

rahuldave t1_iz0axya wrote

Seeing a lot of answers here, and the OP has absolutely the right idea in making sure that there is no overfitting whatsoever. And that whatver error is obtained on the initial validation, and the later test set is an UNBIASED, not over-confident estimate of the actual error rate.

BUT the OP is also right that at this rate you will keep on eating up more and more data. And that practically there is a problem with that.

The question you have to ask yourself if what is the degree of overfitting. Whats the over-confidence? And the answer depends on how many things you are comparing. On the initil validation set you are doing a grid search over a large hyper-parameter space. Any end-estimates of error on this set will be wildly overconfident. But if you are comparing 3-4 estimates of the error on the test set to choose the best model class this is not a large comparison, and so the test set is "not so contaminated" by this comparison, and can be used for other purposes. So the error estimates on the test set are quite good, even after usage in a (small) model comparison.

A (probably hard to read as its embedded in other stuff) explanation can be found in some of my old lecture notes here: https://github.com/AM207/2018spring/blob/661dae036dcb535f9e6dfeb0f115a5ecc16dc123/wiki/testingtraining.md#is-this-still-a-test-set

This is a bit of a hand-waving argument, using Hoeffding's inequality. If youb really want to go into more detail on this you need to understand some Vapnik-Cerenkovits theory. These are also very nicely described in this lecture series (see atleast 2 and 5, followed by 6 and 7 if you like) https://work.caltech.edu/telecourse#lectures.

To finalize: yess you can compare your model classes on the validation sets. But because of the hyperparameter optimization on them, the actual errors (like MSE) you calculate will be too optimistic. The test set, because you are only using it to compare "few" model classes, is still reusable for other stuff.

3

killver t1_iz0bw96 wrote

> But because of the hyperparameter optimization on them, the actual errors (like MSE) you calculate will be too optimistic.

This is the only argument for me to have a separate test dataset that you can make a more unbiased statement regarding accuracy. But I can promise you that no practicioner or researcher will set this test dataset apart and not make a decision on it, even if only subconsciously, which again biases it.

I think the better strategy is to focus on not making too optimistic statements on k-fold validation scores such as not doing automatic early stopping, not doing automatic learning rate schedulers, etc. The goal is to always only select hyperparameters that are optimal on all folds, vs. only optimal separate per fold.

2

rahuldave t1_iz0cb9o wrote

100% agree with you on both points.

On the first one, the biasing of the test set, my point is, dont worry about it, the bias is minor.

On the second, YES to all you said. You WILL be overfitting otherwise. I like Machine Learning, philosophically this way: it is not optimization. Find something "good enough" and it is more likely your generalizability is safe, rather than find the tippy top optimum....

1

Visual-Arm-7375 OP t1_iz2fh6e wrote

Thank you very much for the answer!

One question, what do you mean in this context by estimates? Hyperparameters?

>But if you are comparing 3-4 estimates of the error on the test set to choose the best model class this is not a large comparison, and so the test set is "not so contaminated" by this comparison, and can be used for other purposes.

Could you explain this in another way pls, I'm not sure I am understanding it :(

1

rahuldave t1_iz4tnr9 wrote

Sure! My point is that the number of comparisons you make on a set affects the amount of overfitting you will encounter. Lets look at the sets: (a) training: you are comparing ALL the model parameters from TONS of models on this set: infinite really because of the calculus driven optimization process. (b) validation: you are comparing far less here. Maybe a 10x10x10 hyper-paramer space. So the overfitting potential is less (c) test: maybe only the nest fit random forest against the best fot gradient boosting. So 2 comparisons. So less overfitting.

But how much? Well, that depends on your amount of data. The less data you have, the more likely you will overfit to a given set. This is the same reason we use cross-validation for smaller datasets, but in the neural net or recommendations space with tons of data, we only use a validation set. And these sets are huge, maybe 200000 images or similar number of data points about customers. So now you dont overfit too much even if you compared 1000 points on a hyper-parameter grid.

So the point is you will always overfit some on the validation, and extremely little on the test. If you have very little data, you want this extra test. I know, its a curse: less data and i am asking u to split it more. But think of it like this, its less data, so having less to train on means your trianing process will pick a more conservative model (less max-depth of trees for example). So its not all bad.

But if you have lots of data and a large validation set, you can be a bit of a cowboy. Pick your hyperparameters and choose the best model amongst model classes on the validation set...

2

Visual-Arm-7375 OP t1_iz5inba wrote

Thank's for the answer! I don't understand the separation you are doing btw training and validation. Didn't we have train/test and we applied cv to the train? The validation sets would be 1 fold at each cv iteration. What I am not understanding here?

1

rahuldave t1_iz5lmbz wrote

You dont always cross-validate! Yes sometimes after you do the train-test split u will use something like GridCv in sklearn to cross validate. But think of having to do 5-fold cross validation for a large NN model taking 10 days to train..you now just spent 50 days! So there you take the remaining training set after the test was left out (if u left a test out) and split into a smaller training set and a validation set.

1

Visual-Arm-7375 OP t1_iz5iyf3 wrote

And why the overfitting depends on the number of comparisons, isn't the overfitting something relation to each model separately?

1

rahuldave t1_iz5l8ee wrote

Each model can individually overfit to the training set. For example imagine 30 data points and fit via a 30th order polynomial, or anything with 30 parameters. You will overfit because you are using too complex a model. Here the overfitting is directly related to the data size, and came because you chose too complex a model.

In a sense you can think of a more complex model having more wiggles or more ways to achieve a given value. And you want to disambiguate these more complex ways from just a little data, you cant help but particularize to the data.

But the same problem happens on the validation set. Suppose I have 1000 grid points in hyperparameter space to compare. But just a little bit of data, say again 30 points. You should feel a sense of discomfort: an idiosyncratic choice of 30 points may well give you the "wrong" answer, wrong in the sense of generalizing poorly.

So the first overfitting, which we do hyper-parameter optimization on validation set to avoid happens on the train set. But the second one happens on the validation set, or any set you compare many many models on. This happens a lot on the public leaderboard in Kaggle, especially if you didnt create your own validation set in advance..

(one way, btw to think of this is that if i try enough combinations of parameters, one of them will be good on the data i have, and this is far more likely if the data is smaller, because i dont have to go through so many combinations..)

2