I'm trying different ML models in a dataset of approximately 1000 data points. I would like to evaluate the performance of different families of models (logistic regression, random forest, etc) and select one among them as the best model to put into production.

Discussing how to implement the models, the following doubt arose:

My approach would be the following:

1 - Split train/test
2 - Select the best possible model of your family of models using cross-validation in train data (selecting hyperparameters).
3 -Train the model in the whole training data and evaluate performances with some metric.
4 - Now evaluate performance in the test (with the fitted model on all the training data) with the same metric in 3. One can see if the model is overfitting by comparing training and test results and also with the test performance metric you can select the best model among all the classes of models (knn, random forest, logistic, etc).
5 - Once we have selected the model, use all the data (train/test) to predict the final model to put it into production.

The question is if it is really necessary to (1) train/test split the data instead of (2) applying cross-validation on all the training data for model comparison.

The problem in (1) is that with the test split you are losing lots of data points in this case as we only have approximately 1000 data points.

In (2), you are comparing the models with the same metric you are using to get the hyperparameters. Is this problematic? And also this is not valid for checking if the methods are overfitting as you are not seeing how the algorithm is working in new unseen data. Is this right?

So which one should be the approach? Train/test with the steps I have previously defined or just cross-validation on all the training data and compare the results for each method.

Thank you!

Comments

You must log in or register to comment.

MUSEy69 t1_iyzzbxr wrote on December 5, 2022 at 1:32 PM

Hi, you should always have an independent test split, and do whatever you want with the other, e.g. Crossvalidation visual sklearn reference

Why are you losing lots of datapoint in the test split? the idea is that distributions match so you can use the p-value criteria for this.

If you want to test lots of models try, optuna for finding the best hparams. No problem using the same metric, that's the one you care at the end.

Depending on your domain I would ignore step 5, because you can test disfribution shifts, and even new models in time and be able to compare them.

Visual-Arm-7375 OP t1_iz2e6z6 wrote on December 5, 2022 at 11:27 PM

Thank you for the reply! Step 5 is because I have to submit the predictions for a separated from which I don't know the labels. So my idea was to use all the data.

MUSEy69 t1_iz4gkn2 wrote on December 6, 2022 at 12:13 PM

Thank you for your question, it generated different points of view, from which I learned a lot.

killver t1_iz013vj wrote on December 5, 2022 at 1:48 PM

> you should always have an independent test split

nope, this is not true

[deleted] t1_iz07acr wrote on December 5, 2022 at 2:39 PM

Please elaborate. Are you suggesting that we should hyperparameter-tune on the test set?

killver t1_iz0a6xk wrote on December 5, 2022 at 3:01 PM

No the opposite. So why would you need a test set?

I am arguing that the test data is basically useless, because if you make a decision on it based on performance it is just another validation dataset, and if not you can better use the data for training.

rahuldave t1_iz0axya wrote on December 5, 2022 at 3:07 PM

Seeing a lot of answers here, and the OP has absolutely the right idea in making sure that there is no overfitting whatsoever. And that whatver error is obtained on the initial validation, and the later test set is an UNBIASED, not over-confident estimate of the actual error rate.

BUT the OP is also right that at this rate you will keep on eating up more and more data. And that practically there is a problem with that.

The question you have to ask yourself if what is the degree of overfitting. Whats the over-confidence? And the answer depends on how many things you are comparing. On the initil validation set you are doing a grid search over a large hyper-parameter space. Any end-estimates of error on this set will be wildly overconfident. But if you are comparing 3-4 estimates of the error on the test set to choose the best model class this is not a large comparison, and so the test set is "not so contaminated" by this comparison, and can be used for other purposes. So the error estimates on the test set are quite good, even after usage in a (small) model comparison.

A (probably hard to read as its embedded in other stuff) explanation can be found in some of my old lecture notes here: https://github.com/AM207/2018spring/blob/661dae036dcb535f9e6dfeb0f115a5ecc16dc123/wiki/testingtraining.md#is-this-still-a-test-set

This is a bit of a hand-waving argument, using Hoeffding's inequality. If youb really want to go into more detail on this you need to understand some Vapnik-Cerenkovits theory. These are also very nicely described in this lecture series (see atleast 2 and 5, followed by 6 and 7 if you like) https://work.caltech.edu/telecourse#lectures.

To finalize: yess you can compare your model classes on the validation sets. But because of the hyperparameter optimization on them, the actual errors (like MSE) you calculate will be too optimistic. The test set, because you are only using it to compare "few" model classes, is still reusable for other stuff.

killver t1_iz0bw96 wrote on December 5, 2022 at 3:14 PM

> But because of the hyperparameter optimization on them, the actual errors (like MSE) you calculate will be too optimistic.

This is the only argument for me to have a separate test dataset that you can make a more unbiased statement regarding accuracy. But I can promise you that no practicioner or researcher will set this test dataset apart and not make a decision on it, even if only subconsciously, which again biases it.

I think the better strategy is to focus on not making too optimistic statements on k-fold validation scores such as not doing automatic early stopping, not doing automatic learning rate schedulers, etc. The goal is to always only select hyperparameters that are optimal on all folds, vs. only optimal separate per fold.

rahuldave t1_iz0cb9o wrote on December 5, 2022 at 3:17 PM

100% agree with you on both points.

On the first one, the biasing of the test set, my point is, dont worry about it, the bias is minor.

On the second, YES to all you said. You WILL be overfitting otherwise. I like Machine Learning, philosophically this way: it is not optimization. Find something "good enough" and it is more likely your generalizability is safe, rather than find the tippy top optimum....

Visual-Arm-7375 OP t1_iz2fh6e wrote on December 5, 2022 at 11:37 PM

Thank you very much for the answer!

One question, what do you mean in this context by estimates? Hyperparameters?

>But if you are comparing 3-4 estimates of the error on the test set to choose the best model class this is not a large comparison, and so the test set is "not so contaminated" by this comparison, and can be used for other purposes.

Could you explain this in another way pls, I'm not sure I am understanding it :(

rahuldave t1_iz4tnr9 wrote on December 6, 2022 at 2:16 PM

Sure! My point is that the number of comparisons you make on a set affects the amount of overfitting you will encounter. Lets look at the sets: (a) training: you are comparing ALL the model parameters from TONS of models on this set: infinite really because of the calculus driven optimization process. (b) validation: you are comparing far less here. Maybe a 10x10x10 hyper-paramer space. So the overfitting potential is less (c) test: maybe only the nest fit random forest against the best fot gradient boosting. So 2 comparisons. So less overfitting.

But how much? Well, that depends on your amount of data. The less data you have, the more likely you will overfit to a given set. This is the same reason we use cross-validation for smaller datasets, but in the neural net or recommendations space with tons of data, we only use a validation set. And these sets are huge, maybe 200000 images or similar number of data points about customers. So now you dont overfit too much even if you compared 1000 points on a hyper-parameter grid.

So the point is you will always overfit some on the validation, and extremely little on the test. If you have very little data, you want this extra test. I know, its a curse: less data and i am asking u to split it more. But think of it like this, its less data, so having less to train on means your trianing process will pick a more conservative model (less max-depth of trees for example). So its not all bad.

But if you have lots of data and a large validation set, you can be a bit of a cowboy. Pick your hyperparameters and choose the best model amongst model classes on the validation set...

Visual-Arm-7375 OP t1_iz5inba wrote on December 6, 2022 at 5:11 PM

Thank's for the answer! I don't understand the separation you are doing btw training and validation. Didn't we have train/test and we applied cv to the train? The validation sets would be 1 fold at each cv iteration. What I am not understanding here?

rahuldave t1_iz5lmbz wrote on December 6, 2022 at 5:30 PM

You dont always cross-validate! Yes sometimes after you do the train-test split u will use something like GridCv in sklearn to cross validate. But think of having to do 5-fold cross validation for a large NN model taking 10 days to train..you now just spent 50 days! So there you take the remaining training set after the test was left out (if u left a test out) and split into a smaller training set and a validation set.

Visual-Arm-7375 OP t1_iz5iyf3 wrote on December 6, 2022 at 5:13 PM

And why the overfitting depends on the number of comparisons, isn't the overfitting something relation to each model separately?

rahuldave t1_iz5l8ee wrote on December 6, 2022 at 5:28 PM

Each model can individually overfit to the training set. For example imagine 30 data points and fit via a 30th order polynomial, or anything with 30 parameters. You will overfit because you are using too complex a model. Here the overfitting is directly related to the data size, and came because you chose too complex a model.

In a sense you can think of a more complex model having more wiggles or more ways to achieve a given value. And you want to disambiguate these more complex ways from just a little data, you cant help but particularize to the data.

But the same problem happens on the validation set. Suppose I have 1000 grid points in hyperparameter space to compare. But just a little bit of data, say again 30 points. You should feel a sense of discomfort: an idiosyncratic choice of 30 points may well give you the "wrong" answer, wrong in the sense of generalizing poorly.

So the first overfitting, which we do hyper-parameter optimization on validation set to avoid happens on the train set. But the second one happens on the validation set, or any set you compare many many models on. This happens a lot on the public leaderboard in Kaggle, especially if you didnt create your own validation set in advance..

(one way, btw to think of this is that if i try enough combinations of parameters, one of them will be good on the data i have, and this is far more likely if the data is smaller, because i dont have to go through so many combinations..)

Visual-Arm-7375 OP t1_iz7eoyv wrote on December 7, 2022 at 12:44 AM

Really appreciate your time! Thank you very much!

killver t1_iz00wp1 wrote on December 5, 2022 at 1:47 PM

Having a separate test dataset is useless, and you just waste available data. Just do proper cross-validation, evaluate on all folds, and you are good to go.

Visual-Arm-7375 OP t1_iz016u8 wrote on December 5, 2022 at 1:49 PM

Okay, got two answers, one that says separate always a test set and the other one that is useless :(

killver t1_iz01lwc wrote on December 5, 2022 at 1:53 PM

Well, you already answered it yourself. Why would you need a separate test dataset? It is just another validation dataset, and you already have five of those in case of 5-fold cross validation.

The only important thing is that you optimize your hyperparamters so that they are best across all folds.

The real test data is your future production data, where you apply your predictions.

Visual-Arm-7375 OP t1_iz01vfm wrote on December 5, 2022 at 1:55 PM

Yeah, I understand that. But what's the point of separating the test set then? You are using cross-validation to select the hyperparameters, you are not seeing how they work in new data...

killver t1_iz023cj wrote on December 5, 2022 at 1:57 PM

The validation data is new data. You are not training on it obviously.

Test data in your definition, would be just another validation daset.

Visual-Arm-7375 OP t1_iz02byy wrote on December 5, 2022 at 1:59 PM

Not really, because you are taking the average across all the folds, so at some point you are using that validation splits for training some of the folds, not with the test split.

killver t1_iz02ql9 wrote on December 5, 2022 at 2:03 PM

I think you are misunderstanding it. Each validation fold is always a separate holdout dataset, so when you evaluate your model on it, you are not training on it. Why would it be a problem training on that fold for another validation holdout.

Actually your point 5 is also what you can do in the end, for production model to make use of all data.

The main goal of cross validation is to find hyperparamters that make your model generalize well.

If you take a look at papers or Kaggle, you will never find someone having both validation and test data locally. The test data usually is the real production data, or data you compare the models on. But you make decisions on your local cross validation to find a model that can generalize well on unseen test data (that is not in your current possession).

Visual-Arm-7375 OP t1_iz03bcf wrote on December 5, 2022 at 2:08 PM

Mmmm okay. But imagine you have 1000 datapoints and you want to compare a random forest and a DNN and select which one is the best to put it into production, how would you do it?

killver t1_iz03hvr wrote on December 5, 2022 at 2:09 PM

Do a 5-fold cross validation, train both models 5 times, and compare the OOF scores.

And of course optimize hyperparameters for each model type.

Visual-Arm-7375 OP t1_iz02j8x wrote on December 5, 2022 at 2:01 PM

I don't know if you know what I mean, how can you test if the hyperparameters are overfitting if you have selected them as the ones that maximize the mean accuracy across all the folds.

killver t1_iz02xc3 wrote on December 5, 2022 at 2:04 PM

Other question: how can hyperparameters overfit on validation data, if it is a correct holdout set?

In your definition, if you make the decision on another local test holdout, the setting is exactly the same, no difference. And if you do not make a decision on this test dataset, why do you need it?

The important thing is that your split is not leaky and represents the unseen test data well.

Visual-Arm-7375 OP t1_iz03pqj wrote on December 5, 2022 at 2:11 PM

Is not validation data, it is test data. You haven't checked the accuracy on the test data as another fold you average for getting the mean accuracy in the cross-validation. There you can see how the model is generalizing with the hyperparameters selected in the cross-validation.

killver t1_iz03u95 wrote on December 5, 2022 at 2:12 PM

And then what?

Visual-Arm-7375 OP t1_iz04glk wrote on December 5, 2022 at 2:17 PM

Have a look at this: https://towardsdatascience.com/train-test-split-c3eed34f763b

Visual-Arm-7375 OP t1_iz04hi2 wrote on December 5, 2022 at 2:17 PM

What do you think about it?

killver t1_iz04uyh wrote on December 5, 2022 at 2:20 PM

Look - I will not read now through a random blog, either you believe me and try to critically think it through or you already made up your mind anyways, then you should not have asked.

I will add a final remark.

If you make another decision (whether it generalizes well or not) on your holdout test dataset, you are basically just making another decision on it. If it does not generalize, what do you do next? You change your hyperparameters so that in works better on this test set?

What is different then vs. doing this decision on your validation data?

The terms validation and test data are mixed a lot in literature. In principle the test dataset how you define it, is just another validation dataset. And you can be more robust, by just doing multiple validation datasets, which k-fold is doing. You do not need this extra test dataset.

If you feel better doing it, go ahead. It is not "wrong" - but just not necessary and you lose train data.

Visual-Arm-7375 OP t1_iz065iq wrote on December 5, 2022 at 2:30 PM

I don't have a clear opinion, I'm trying to learn and I'm proposing a situation and you're not listening. You are evaluating the performance of the model with the same accuracy you are selecting hyperparameters, this does not make sense.

Anyway, thank you for your help, really appreciate it.

killver t1_iz06mz6 wrote on December 5, 2022 at 2:34 PM

Maybe that's your confusion, getting a raw accuracy score that you are communicating, vs. finding and selecting hyperparameters/models. Your original post asked about model comparison.

Anyways, I suggest you take a look at how research papers are doing it, and also browse through Kaggle solutions. Usually people are always doing local cross validation, and the actual production data is the test set (e.g. ImageNet, Kaggle Leaderboard, Business Production data, etc.).

rahuldave t1_iz4u6o4 wrote on December 6, 2022 at 2:21 PM

Many kaggle competitions will have public and private leaderboards. And you are strongly advised to separate out your own validation set from the training data they give you to choose your best model to compare on the public leaderboard. And there are times people have fit to the public leaderboard, but this can be checked with adverserial validation and the like. If you like this kinda stuff, both Abhishek Thakur and Konrad Banachevicz's books are real nice...