no_witty_username t1_iymgyhr wrote on December 2, 2022 at 2:23 PM

Reply to comment by cantfindaname2take in [Discussion] - "data sourcing will be more important than model building in the era of foundational model fine-tuning" by fourcornerclub

Humans do get clean data when learning. Here is what bad data looks like for humans. Ocular degeneration, deafness, neurological disorder, etc.... Children who have various sensory deformities or diseases that cause damage to their sensory organs all have severe learning difficulties. Same goes with machines when they are presented shit data. The machines ability to understand anything is dependent on many factors, and some of the most important factors are presenting it with data it was built to process. Showing a machine a picture of a bad image crop of a person where the top half of said person is fully missing and the image displayed only neck down and telling it that's what a person is is bad data as much as showing an image to a child of anything with ocular degeneration . The image is severely distorted and while the brain of the child is quite capable of proper learning, its sensors aka the eyes are presenting shit data, so no proper learning will occurs.

no_witty_username t1_iyiqgp4 wrote on December 1, 2022 at 6:26 PM

Reply to [Discussion] - "data sourcing will be more important than model building in the era of foundational model fine-tuning" by fourcornerclub

Data curation is the biggest bottle neck for making AAA quality models.

no_witty_username t1_ivffk9p wrote on November 7, 2022 at 4:09 PM

Reply to [D] Do you think there is a competitive future for smaller, locally trained/served models? by naequs

All evidence currently points that it is quite possible even for one person to make a high quality model all by themselves. It will take a great effort in high quality data curation, but I do not see anything that is out of reach. The only reason this field has a perception of a large data set requirement, is because a large amount of data was used to train the base model. But what folks don't seem to understand is that the quantity of data used in training the base model was EXTREMELY poor. Bad captions, bad cropping, redundancies, mis-categorizations, and a plethora of other issues plague the training data. The base SD model could have been trained with orders of magnitude less data, if due diligence was used in data curation.

This is the case for Stable Diffusion. I would not be surprised if this was the case for other models as well.