Viewing a single comment thread. View all comments

killver t1_iz00wp1 wrote

Having a separate test dataset is useless, and you just waste available data. Just do proper cross-validation, evaluate on all folds, and you are good to go.

−3

Visual-Arm-7375 OP t1_iz016u8 wrote

Okay, got two answers, one that says separate always a test set and the other one that is useless :(

1

killver t1_iz01lwc wrote

Well, you already answered it yourself. Why would you need a separate test dataset? It is just another validation dataset, and you already have five of those in case of 5-fold cross validation.

The only important thing is that you optimize your hyperparamters so that they are best across all folds.

The real test data is your future production data, where you apply your predictions.

1

Visual-Arm-7375 OP t1_iz01vfm wrote

Yeah, I understand that. But what's the point of separating the test set then? You are using cross-validation to select the hyperparameters, you are not seeing how they work in new data...

1

killver t1_iz023cj wrote

The validation data is new data. You are not training on it obviously.

Test data in your definition, would be just another validation daset.

2

Visual-Arm-7375 OP t1_iz02byy wrote

Not really, because you are taking the average across all the folds, so at some point you are using that validation splits for training some of the folds, not with the test split.

1

killver t1_iz02ql9 wrote

I think you are misunderstanding it. Each validation fold is always a separate holdout dataset, so when you evaluate your model on it, you are not training on it. Why would it be a problem training on that fold for another validation holdout.

Actually your point 5 is also what you can do in the end, for production model to make use of all data.

The main goal of cross validation is to find hyperparamters that make your model generalize well.

If you take a look at papers or Kaggle, you will never find someone having both validation and test data locally. The test data usually is the real production data, or data you compare the models on. But you make decisions on your local cross validation to find a model that can generalize well on unseen test data (that is not in your current possession).

1

Visual-Arm-7375 OP t1_iz03bcf wrote

Mmmm okay. But imagine you have 1000 datapoints and you want to compare a random forest and a DNN and select which one is the best to put it into production, how would you do it?

1

killver t1_iz03hvr wrote

Do a 5-fold cross validation, train both models 5 times, and compare the OOF scores.

And of course optimize hyperparameters for each model type.

1

Visual-Arm-7375 OP t1_iz02j8x wrote

I don't know if you know what I mean, how can you test if the hyperparameters are overfitting if you have selected them as the ones that maximize the mean accuracy across all the folds.

1

killver t1_iz02xc3 wrote

Other question: how can hyperparameters overfit on validation data, if it is a correct holdout set?

In your definition, if you make the decision on another local test holdout, the setting is exactly the same, no difference. And if you do not make a decision on this test dataset, why do you need it?

The important thing is that your split is not leaky and represents the unseen test data well.

1

Visual-Arm-7375 OP t1_iz03pqj wrote

Is not validation data, it is test data. You haven't checked the accuracy on the test data as another fold you average for getting the mean accuracy in the cross-validation. There you can see how the model is generalizing with the hyperparameters selected in the cross-validation.

1

killver t1_iz03u95 wrote

And then what?

1

Visual-Arm-7375 OP t1_iz04glk wrote

1

killver t1_iz04uyh wrote

Look - I will not read now through a random blog, either you believe me and try to critically think it through or you already made up your mind anyways, then you should not have asked.

I will add a final remark.

If you make another decision (whether it generalizes well or not) on your holdout test dataset, you are basically just making another decision on it. If it does not generalize, what do you do next? You change your hyperparameters so that in works better on this test set?

What is different then vs. doing this decision on your validation data?

The terms validation and test data are mixed a lot in literature. In principle the test dataset how you define it, is just another validation dataset. And you can be more robust, by just doing multiple validation datasets, which k-fold is doing. You do not need this extra test dataset.

If you feel better doing it, go ahead. It is not "wrong" - but just not necessary and you lose train data.

1

Visual-Arm-7375 OP t1_iz065iq wrote

I don't have a clear opinion, I'm trying to learn and I'm proposing a situation and you're not listening. You are evaluating the performance of the model with the same accuracy you are selecting hyperparameters, this does not make sense.

Anyway, thank you for your help, really appreciate it.

1

killver t1_iz06mz6 wrote

Maybe that's your confusion, getting a raw accuracy score that you are communicating, vs. finding and selecting hyperparameters/models. Your original post asked about model comparison.

Anyways, I suggest you take a look at how research papers are doing it, and also browse through Kaggle solutions. Usually people are always doing local cross validation, and the actual production data is the test set (e.g. ImageNet, Kaggle Leaderboard, Business Production data, etc.).

1

rahuldave t1_iz4u6o4 wrote

Many kaggle competitions will have public and private leaderboards. And you are strongly advised to separate out your own validation set from the training data they give you to choose your best model to compare on the public leaderboard. And there are times people have fit to the public leaderboard, but this can be checked with adverserial validation and the like. If you like this kinda stuff, both Abhishek Thakur and Konrad Banachevicz's books are real nice...

0