Submitted by pgao_aquarium t3_11l4xo0 in MachineLearning
I have a relatively contrarian take that most deep learning applications are not super different from each other.
It feels like in traditional software engineering, there is at least a set of well-known best practices that have been accumulated over time as people figured out what works and doesn't work. For example, most teams have some sort of CI/CD flow and a hosted version control system. People can ignore this, but they do so at their own risk.
In applied deep learning (typical supervised tasks on text / imagery / etc), I've seen a lot of industry ML teams spin their wheels when they get to the stage of improving their model performance instead of following a more disciplined workflow that, in my opinion, more reliably produce results. I think we're all trying to figure out what the best practices around ML development should look like, but here's my opinionated contribution on that front. Feedback welcome!
https://www.aquariumlearning.com/blog-posts/to-make-your-model-better-first-figure-out-whats-wrong
KD_A t1_jbb5kx5 wrote
The section "Check if your model is overfitting" could be improved.
> The model is overfitting (high variance) when it has low error on the training set but high error on the test set.
A big gap between training and validation error does not imply that it is overfitting. In general, an absolute gap between training and validation errors does not tell you how validation error will change if a model is made more complex or more simple. To answer questions about overfitting and underfitting, one needs to train multiple models and compare their training and validation errors.
> Overfitting and underfitting is easy to detect by visualizing loss curves during training.
nit: this caption is phrased too liberally, as the graph only answers this question: given this model architecture, optimizer, and dataset, which model epoch/checkpoint should I select? It does not tell you about any other factors which modulate model complexity.
> This often means that the training set is not representative of the domain it is supposed to run in.
I wouldn't call this a variance issue per se. If it were a variance issue, sampling more data from the training distribution should significantly lower validation error. If the training distribution is biased, sampling more of it will not help a whole lot.
That all being said, I share your passion for greater standardization of ML workflow. And I agree that there needs to be more work on diagnosing problems, and less "throwing stuff at the wall". To add something, I now typically run learning curves. They can cost quite a bit when training big NNs. But even a low-resolution curve can give a short term answer to an important question: how much should I expect this model to improve if I train it on
n
more observations? And assuming you have a decent sense of your model's capacity, this question is closely related to another common one: should I prioritize collecting more data, or should I make a modeling intervention? Learning curves have motivated big improvements in my experience.