Viewing a single comment thread. View all comments

e_for_oil-er t1_j2xz49i wrote

I guess "errors" in the dataset could be equivalent to introducing noise (like random perturbations with mean 0) or a bias (perturbation with non 0 expectation). I guess those would be the two main kind of innacuracies found in data.

Bias has been the plague of some language models which were based on internet forum data. The training data was biased towards certain opinions, and the model just spat them out. This is has caused the creators of those models to shut them down. I don't know how could one do to correct bias, since this is not at all my expertise.

Learning techniques resistant to noise (often called robust) are an active field of research, and some methods actually perform really well.

2