iknowjerome OP t1_ivtti9x wrote on November 10, 2022 at 3:47 PM

Reply to comment by that_username__taken in [R] A relabelling of the COCO 2017 dataset by iknowjerome

Every dataset has errors and inconsistencies. It is true that some have more than others, but what really matters is how that affects the end goal. Sometimes, the level of inconsistencies doesn't impact model performance as much as one would expect. In other cases, it is the main cause of a poor model performance, at least in one area (for instance, for a specific set of classes). I totally agree with you that companies that succeed in putting and maintaining AI models in production pay particular attention to the quality of the datasets that are created for training and testing purposes.

that_username__taken t1_ivttxzf wrote on November 10, 2022 at 3:50 PM

Yeah I agree, but finding those errors at the end of the cycle is extremely painful and time consuming.

iknowjerome OP t1_ivtw0xs wrote on November 10, 2022 at 4:04 PM

The trick is not to wait for the end of the cycle to make the appropriate adjustments. And there are now a number of solutions on the market that help with understanding and visualizing your image/video data and labels.

Mozillah0096 t1_ivtxgd3 wrote on November 10, 2022 at 4:14 PM

u/iknowjerome can u tell me those solutions which u are talking about

iknowjerome OP t1_ivweoyi wrote on November 11, 2022 at 2:36 AM

Lightly and Voxel51 just to name a couple I'm pretty familiar with.

[deleted] t1_ix5ewa2 wrote on November 20, 2022 at 9:52 PM

[deleted]

jonas__m t1_ix5ey4i wrote on November 20, 2022 at 9:53 PM

cleanlab is an open-source python library that checks data and label quality