e_for_oil-er t1_j2xz49i wrote on January 4, 2023 at 7:00 PM

Reply to comment by groman434 in [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434

I guess "errors" in the dataset could be equivalent to introducing noise (like random perturbations with mean 0) or a bias (perturbation with non 0 expectation). I guess those would be the two main kind of innacuracies found in data.

Bias has been the plague of some language models which were based on internet forum data. The training data was biased towards certain opinions, and the model just spat them out. This is has caused the creators of those models to shut them down. I don't know how could one do to correct bias, since this is not at all my expertise.

Learning techniques resistant to noise (often called robust) are an active field of research, and some methods actually perform really well.