Submitted by Visual-Arm-7375 t3_zd6a6j in MachineLearning
I'm trying different ML models in a dataset of approximately 1000 data points. I would like to evaluate the performance of different families of models (logistic regression, random forest, etc) and select one among them as the best model to put into production.
Discussing how to implement the models, the following doubt arose:
My approach would be the following:
1 - Split train/test
2 - Select the best possible model of your family of models using cross-validation in train data (selecting hyperparameters).
3 -Train the model in the whole training data and evaluate performances with some metric.
4 - Now evaluate performance in the test (with the fitted model on all the training data) with the same metric in 3. One can see if the model is overfitting by comparing training and test results and also with the test performance metric you can select the best model among all the classes of models (knn, random forest, logistic, etc).
5 - Once we have selected the model, use all the data (train/test) to predict the final model to put it into production.
The question is if it is really necessary to (1) train/test split the data instead of (2) applying cross-validation on all the training data for model comparison.
The problem in (1) is that with the test split you are losing lots of data points in this case as we only have approximately 1000 data points.
In (2), you are comparing the models with the same metric you are using to get the hyperparameters. Is this problematic? And also this is not valid for checking if the methods are overfitting as you are not seeing how the algorithm is working in new unseen data. Is this right?
So which one should be the approach? Train/test with the steps I have previously defined or just cross-validation on all the training data and compare the results for each method.
Thank you!
MUSEy69 t1_iyzzbxr wrote
Hi, you should always have an independent test split, and do whatever you want with the other, e.g. Crossvalidation visual sklearn reference
Why are you losing lots of datapoint in the test split? the idea is that distributions match so you can use the p-value criteria for this.
If you want to test lots of models try, optuna for finding the best hparams. No problem using the same metric, that's the one you care at the end.
Depending on your domain I would ignore step 5, because you can test disfribution shifts, and even new models in time and be able to compare them.