suflaj t1_ir1ejki wrote on October 4, 2022 at 5:05 PM

You should probably try to reduce your dataset size first and then tune hyperparameters with that.

What I would do is start with randomly sampled 100 samples. Train fully with that. Then double it for the same hyperparameters and see how the performance changes. You want to stop when the performance no longer changes significantly after doubling the data.

How much is significantly? Well, I would personally stop when doubling the data doesn't halve the test error. But that criterion is arbitrary, so ymmv, and you should adjust it based on how fast it increases. Think of what performance would be acceptable for an average person who is neither stupid, nor informed enough to know your model could be much better. You just need enough data to consider your hyperparameters representative.

If you do not know how to tune that, then try clustering your data strictly. Ex., if you have text, you could divide it into 2-grams, use MinHashes and then say the threshold for a cluster is 1% similarity. This will give you very few clusters from which you can pick the representative and use that as a sample for your dev test.

Search your hyperparameters randomly within a distribution when you reach those diminishing returns and then train with those hyperparameters on the full dataset. Depending on the network the diminishing returns point will be anywhere from 1k (CV resnets) to 100k samples (finetuning transformers).