FirstBabyChancellor
FirstBabyChancellor t1_j60a0n5 wrote
!remindme 2 days
FirstBabyChancellor t1_isdrhk8 wrote
How would the annotator even know how to label out of 10K different labels? Having that many possible labels feels like a recipe for inaccurate labels because the annotator will simply get confused or zone out.
I obviously don't know the specifics of your data, but that's my first impression from reading your post.
Anyways, one solution that addresses both this and your issues with the labelling tools would be to break the task into multiple rounds. If the items are products, then start off by grouping the individual products into categories, such as foods, furniture, etc.
Then, in subsequent rounds, drill down into the specifics. This is obviously more time consuming and expensive, though, because you're labelling each item at least twice.
One refinement you could make here is to use preliminary models for the assignment of the labels into the broader categories. The is, hand label a relatively small subset of the data with examples of each category (foods, furniture, etc.). Train a model on this small amount of data and generate classification probabilities to associate each image with one of these categories. Define some threshold (e.g., 25%) for inclusion in a given category. Note that this means that if p(food) =30% and p(furniture)=28%, then this image will be labelled as part of both the food and furniture rounds. Depending on how good this initial model is (you'll be training on a small amount of data so it's likely not going to have great performance but on the other hand maybe models already exist that can identify the broader categories with ease or transfer learning might help despite the small amount of training data), you'll have to think of a 'sane' threshold. The worse the model is, the lower you should set your threshold, with the trade-off being you will likely include the same image in multiple sub-rounds.
You've now broken to the data into categories and can now have multiple labelling tasks, one for each category, with say a few hundred labels per task instead of 10K. If your labelling tool allows freeform text, add that to the task so that if an annotator sees some food in the furniture task, none of the labels in that task will match, of course, so allow them to tag that image as being food and out of bounds for the given task. Then address these edge cases with a little more labelling at the end.
Also, because some images may belong to multiple rounds if your intial model is bad or your threshold is low, you'll also need to do some post -processing to ensure your labels aren't conflicting (i.e., the same image isn't marked as a food product and a furniture product by different annotators in different rounds). Again, identify such conflicts at the end and do some more labelling, as needed.
FirstBabyChancellor t1_ir55gtb wrote
Reply to comment by diehardwalnut in [R] Google Colab alternative by Zatania
Their free machines are almost never available, in my experience. Also, all notebooks in their free tier are publicly available, which may be a major downside for some folks.
FirstBabyChancellor t1_jdq6cbb wrote
Reply to [D] Title: Best tools and frameworks for working with million-billion image datasets? by v2thegreat
Look into Nvidia DALI. It's moreso designed as a highly efficient and faster alternative to PyTorch's default dataloader, but you can also use it to do a number of preprocessing operations on images -- using GPUs.