Submitted by fourcornerclub t3_z9qx7a in MachineLearning

I was recently having this debate with a data engineering friend. My position was that as foundational models "eat the world" it will become more valuable to be good at sourcing high quality training data for finetuning that building new models. Would love to trigger a wider debate here!

26

Comments

You must log in or register to comment.

AGI_aint_happening t1_iyiw6er wrote

I think that's always been the case, foundation models or not

20

alex_lite_21 t1_iyi9g5g wrote

I agree on this. Relates to the garbage in-garbage out concept. I am not also a fan of data augmentation, at least not in the way that it is commonly done without thinking to much. Getting high quality data is paramount.

8

fourcornerclub OP t1_iylq3v5 wrote

u/alex_lite_21 yeah the status quo of data augmentation to me seems to read like "oh i scraped together quite a shit training set here. Maybe I'll play with it to make it less shit". Rather than thinking "how do I robustly collect a highly suitable dataset from the outset and then iterate from there"

2

no_witty_username t1_iyiqgp4 wrote

Data curation is the biggest bottle neck for making AAA quality models.

4

fourcornerclub OP t1_iylpygx wrote

u/no_witty_username and yet the standard in data sourcing still seems to be "let me see what's open source, and what I can scrape from the internet, and then I'll tune the model from there". Makes no sense to me!

2

FlattenLayer t1_iylz5rc wrote

CTR model was built to predict click-through rates in recommendation systems like TikTok and google and the model was fed tens of billions of samples from the exposure logs. In this case, the most important thing is keeping the exposure log clean. But it's not easy because there is a complex and long pipeline from the exposure log to training samples.

2

cantfindaname2take t1_iym5ipx wrote

Is it though? One thing that comes back up again is the comparison to human learning. Do humans get clean training samples? I like to think not that. Instead humans learn how to separate signal from noise much better, and also learn how to model hidden causes.

2

no_witty_username t1_iymgyhr wrote

Humans do get clean data when learning. Here is what bad data looks like for humans. Ocular degeneration, deafness, neurological disorder, etc.... Children who have various sensory deformities or diseases that cause damage to their sensory organs all have severe learning difficulties. Same goes with machines when they are presented shit data. The machines ability to understand anything is dependent on many factors, and some of the most important factors are presenting it with data it was built to process. Showing a machine a picture of a bad image crop of a person where the top half of said person is fully missing and the image displayed only neck down and telling it that's what a person is is bad data as much as showing an image to a child of anything with ocular degeneration . The image is severely distorted and while the brain of the child is quite capable of proper learning, its sensors aka the eyes are presenting shit data, so no proper learning will occurs.

5