My ML team is looking to buy/source a dataset of videos of people performing certain niche tasks to train a business-critical model. From our research, it seems like Scale AI, Toloka, Appen, Defined AI, and Clickworker offer solutions in that space.

Has anyone used any of these before and would recommend (or recommend avoiding) them? Are we better off just running the crowdsourcing of the data in-house?

Comments

You must log in or register to comment.

suflaj t1_it4no84 wrote on October 20, 2022 at 10:16 PM

If you have the means to record the dataset in house it's the best way. You can directly talk to the annotators and the subjects, you make sure that this data cannot be redistributed unless someone leaks it, and you will have a better grasp regarding privacy policies. It is also likely to be cheaper.

With external data it is almost impossible to prove you are allowed to have it, and this data can then just be resold to someone else, potentially a competitor.

seiqooq t1_it4vr7f wrote on October 20, 2022 at 11:17 PM

Curious about others experiences as well. We opted to go the data-capturing infra route so I’m in the other boat.

fourcornerclub t1_it76lgq wrote on October 21, 2022 at 12:59 PM

Interesting - what was your experience like here? And what did you use? Thanks!

seiqooq t1_it7ss31 wrote on October 21, 2022 at 3:37 PM

We’re in surveillance and so vertically integrating was (fortunately) an option for us. It was certainly worth it since our org had the means, but the build vs buy trade off is always a thing

DigThatData t1_it59gaq wrote on October 21, 2022 at 1:00 AM

it depends on the data. considering the kind of data your working with is one of the least mature media in the analytics industry (video), it might be both significantly more cost effective and likely to produce a high-quality result if you buy the dataset. That said, if you were thinking of spinning up an in-house data annotation resource, this might be a good opportunity to go that route, and I'm sure the ML team wouldn't have any complaints if you gave them a persistent data generating resource like that.

[deleted] t1_it4xlzw wrote on October 20, 2022 at 11:31 PM

[removed]

[deleted] t1_it59bl1 wrote on October 21, 2022 at 12:59 AM

[removed]

[deleted] t1_it5bm2p wrote on October 21, 2022 at 1:16 AM

[removed]