osedao OP t1_j9v4ip5 wrote on February 24, 2023 at 7:53 PM

Reply to comment by Maximum-Ruin-9590 in [D] Is validation set necessary for non-neural network models, too? by osedao

Yeah that make sense to test models with folds never seen. But I have a small dataset, I’m trying to find the best practice

Additional-Escape498 t1_j9vqmlh wrote on February 24, 2023 at 10:16 PM

For a small dataset still use cross validation, but use k-fold cross validation so you don’t divide the dataset into 3, just into 2 and then the k-fold subdivides the training set. Sklearn has a class for this already built to make this simple. Since you have a small dataset and are using fairly simple models I’d suggest setting k >= 10.

osedao OP t1_j9wa0a1 wrote on February 25, 2023 at 12:35 AM

Thanks for the recommendations! I’ll try this

BrohammerOK t1_j9wvrl7 wrote on February 25, 2023 at 3:26 AM

You can work with 2 splits, which is a common practice. For a small dataset you can use 5 or 10 fold crossvalidation with shuffling on 75-80% of the dataset (train) for hyperparameter tunning / model selection, fit the best model on the entirety of that set, and then evaluate/test on the remaining 25%-20% that you held out. You can repeat the process multiple times with different seeds to get a better estimation of the expected performance, assuming that the input data when you do inference comes from the same distribution as your dataset.

BrohammerOK t1_j9ww6yx wrote on February 25, 2023 at 3:30 AM

If you wanna use something like early stopping, though, you'll have no choice but to use 3 splits.