Fafniiiir t1_j5wxb0j wrote on January 26, 2023 at 3:04 AM

Reply to comment by LAwLzaWU1A in CNET's AI Journalist Appears to Have Committed Extensive Plagiarism by iingot

Imo I think it's a really creepy and worrisome precedent to set that they can just scrape everything they want.
A lot of very bad stuff has been found in these dataset including cp and stuff like isis footage, revenge porn and leaked nudes etc.
Even on a less horrifying note, it's also peoples personal photographs, medical records and before and after photos of operations, peoples private vacations, family photos, id's etc you get the idea.

I do find it a bit worrisome if they can just scrape everything they want online and use it for commercial purposes like this.
At least Disney using their own copyrighted work to train an ai wouldn't run into these ethical problems.

LAwLzaWU1A t1_j5xvmyt wrote on January 26, 2023 at 8:53 AM

I genuinely do not understand why you find that creepy and worrisome. We have allowed humans to do the exact same thing since the beginning of art, yet it seems like it is only an issue when an AI does it. Is it just that people have been unaware of it before and now that people realize how the world works they react to it?

If you have ever commissioned an artist to draw something for you, would you suddenly find it creepy and worrisome if you knew that said artist had once seen an ISIS video on the news? Because seeing that ISIS video on the news did alter how the artist's brain was wired, and could potentially have influenced how they drew your picture in some way (maybe a lot, maybe just 0,0001%, depending on what picture you asked them to draw).

The general advice is that if you don't want someone to see your private vacation photos, don't upload them to public websites for everyone to see. These training data sets like LAION did not hack into peoples' phones and steal the pictures. The pictures ended up in LAION because they were posted to the public web where anyone could see them. This advice was true before AI tools were invented, and it will be true in the future as well. If you don't want someone to see your picture then don't post it on the public web.

Also, there would be ethical problems even if we limited this to just massive corporations. I mean, first of all, it's ridiculous to say "we should limit this technology to massive corporations because they will behave ethically". I mean, come on.

But secondly and more importantly, about companies that don't produce their own content to train their AI on, but rather would rely on user submitted content? If Facebook and Instagram included a clause that said that they were allowed to train their AI models on images submitted, do you think people would stop using Facebook? Hell, for all I know they might already have a clause allowing them to do this. I doubt many people are actually aware of what they allow or don't allow in the terms of service they agree to when signing up for websites.

Edit:

It is also important to understand the amount of data that goes into these models and data sets. LAION-5B consists of 5,85 million images. That is a number so large that it is near impossible for a human to even comprehend it. Here is a good quick and easy visualization of one what one billion is. And here is a longer and more stark visualization because the first video actually uses 100,000 dollars as the "base unit", which by itself is almost too big for humans to comprehend.

Even if someone were to find 1 million images of revenge porn or whatever in the dataset, that's still just 0.02% of the data set, which in and of itself is not the same as 0.02% of the final model produced by the training. We're talking about a million images maybe affecting the output by 0.02%.

How much inspiration does a human draw from the works they have seen? Do we give humans a pass just because we can't quantify how much influence a human artist drew from any particular thing they have seen and experienced?

I also think the scale of these data sets brings up another point. What would a proposed royalty structure even look like? Does an artist which had 100 of their images included in the data set get 100/5,000,000,000 of a dollar (0.000002% of a dollar)? That also assumes that their works actually contributed to the final model in an amount that matches the portion of images in the data set. LAION-5B is 240TB large, and a model trained on it would be ~4GB. 99.99833% of all data is removed when transforming from training data to data model.

How to we accurately calculate the amount of influence you had on the final model which is 0.001% the size of the data set, of which you contributed 0.000002% to? Not to mention that these AIs might create internal models within themselves, which would further diminish the percentages.

Are you owed 0.000002% of 0.001%? And that also assumes that the user of the program accounts for none of the contributions either.

It's utterly ridiculous. These things are being discussed by people who have no understanding of how any of it works, and it really shows.