Submitted by groman434 t3_103694n in MachineLearning
groman434 OP t1_j2xv6xl wrote
Reply to comment by junetwentyfirst2020 in [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434
When I put some thoughts in my question (yes, I know, I should have done it before I posted it), I realised that I was interested in how training in general and training set imperfections impact a model performance. For instance, if a training set is 90% accurate, then how and why the model which used that data for training can be more than 90% accurate. And what kind of errors in the training set the model can correct?
junetwentyfirst2020 t1_j2xxhii wrote
That’s not an easy question to answer because the 90% that are correct may be super easy to fit, and those 10% errors may just be unfittable and will just keep the loss high without impacting the model. On the other hand, since models tend to be very over-parameterized that 10% could very well be “fit” and have an outsized impact on the model. It could also be the case that the model ends up with 10% variance on its accuracy.
I’ve never seen a definitive theoretical answer since deep learning models are over parameterized and have seen models replicate the error in the training data, especially when it came to keypoint prediction. When I assessed the error in the training data, I showed the team that the model has the same degree of error. I was arguing for cleaner training data. I got told no and to come up with a magic solution to fix the problem. I quit 🤣
sayoonarachu t1_j2xypar wrote
Generally, it is a good idea to split your data into a training, validation, and testing set. Something like 80/10/10 or 80/20 depending on how much data you're feeding a neural network (NN).
So, 80% of the data, randomly selected, would be used to train an NN, and with, say, every epoch or batch if using batch normalization, it would validate against what it has "learn."
Once you're happy with said model performance, then you can use the test data set to see how well your model performs to "new" data in the sense that the 10% you set aside for testing was never introduced to the model during training.
Of course, there are many, many other methods to minimize loss, performance, etc. But, even if your network was "perfect," if the person building it didn't spend the time to "clean" the data, then no matter what it will always have some higher degree of error.
Or something like that. I'm just a fledgling when it comes to deep learning.
Viewing a single comment thread. View all comments