Maximum-Ruin-9590 t1_j9v03zg wrote on February 24, 2023 at 7:24 PM

Reply to comment by Maximum-Ruin-9590 in [D] Is validation set necessary for non-neural network models, too? by osedao

As mentioned u need validation sets aka some kind of folds for most things in ML. Crossvalidation and tuning just to name some things. It is also smart to have folds to compare different models with each other.

osedao OP t1_j9v4ip5 wrote on February 24, 2023 at 7:53 PM

Yeah that make sense to test models with folds never seen. But I have a small dataset, I’m trying to find the best practice

Additional-Escape498 t1_j9vqmlh wrote on February 24, 2023 at 10:16 PM

For a small dataset still use cross validation, but use k-fold cross validation so you don’t divide the dataset into 3, just into 2 and then the k-fold subdivides the training set. Sklearn has a class for this already built to make this simple. Since you have a small dataset and are using fairly simple models I’d suggest setting k >= 10.

osedao OP t1_j9wa0a1 wrote on February 25, 2023 at 12:35 AM

Thanks for the recommendations! I’ll try this

BrohammerOK t1_j9wvrl7 wrote on February 25, 2023 at 3:26 AM

You can work with 2 splits, which is a common practice. For a small dataset you can use 5 or 10 fold crossvalidation with shuffling on 75-80% of the dataset (train) for hyperparameter tunning / model selection, fit the best model on the entirety of that set, and then evaluate/test on the remaining 25%-20% that you held out. You can repeat the process multiple times with different seeds to get a better estimation of the expected performance, assuming that the input data when you do inference comes from the same distribution as your dataset.

BrohammerOK t1_j9ww6yx wrote on February 25, 2023 at 3:30 AM

If you wanna use something like early stopping, though, you'll have no choice but to use 3 splits.