Saytahri t1_iv0ikmm wrote on November 4, 2022 at 11:37 AM

They made a blog post about this. They generated a lot of samples and checked for matches in the dataset, there were some, mostly very simple vector art that was duplicated many times over in the dataset.

They removed the duplications and then checked again, no matches.

https://openai.com/blog/dall-e-2-pre-training-mitigations/

disturbing_nickname t1_iv0jdjg wrote on November 4, 2022 at 11:45 AM

Your comment helped me realize that scientists probably soon will prove that AI creates more original content than a human, by analyzing the creative product. Fascinating thought…

petseminary t1_iuzw2mv wrote on November 4, 2022 at 6:28 AM

I'd go a step further and question how you can copyright these outputs if you don't own everything the model was trained on.

C0DASOON t1_iv047p4 wrote on November 4, 2022 at 8:29 AM

The same way a human artist can copyright a piece of art they made after drawing inspiration from other peoples' art.

pdillis t1_iv0z284 wrote on November 4, 2022 at 1:56 PM

I've been using AI/Neural networks since 2018 to make art and this is the argument that (very recently) has gained a lot of popularity in defense of AI art but baffles me the most. A human artist and a Neural Network are not the same: the NN is just a tool, that's why the user is still considered the artist. Giving human qualities to the NN, whenever convenient, is a detriment to the movement as a whole.

C0DASOON t1_iv1622m wrote on November 4, 2022 at 2:45 PM

Stating that a model that uses existing art only to update its parameters should not need special permissions for being exposed to said art and drawing an analogy to how human artists do not need a permission to do so is not giving human qualities to a model, unless your argument is that the only reason humans don't need permission to view or take inspiration from art is because we're making a special exception for the acts of viewing and taking inspiration performed by human beings and that otherwise all exposure to art requires a permission from the copyright holder, which is just as stupid as the existence of copyright in the first place. You do not, and should not need a special permission to use art, or anything else, to update model parameters.

petseminary t1_iv1b96a wrote on November 4, 2022 at 3:19 PM

AI does not draw inspiration. Seeing something and being inspired by it is human. Processing lots of photos of artworks to produce similar works rehashes that data in a fundamentally different way.

kaibee t1_iv1kckb wrote on November 4, 2022 at 4:19 PM

>AI does not draw inspiration. Seeing something and being inspired by it is human. Processing lots of photos of artworks to produce similar works rehashes that data in a fundamentally different way.

So like, Stable Diffusion, the model is 4gb and can be reduced to 2gb without much loss in quality. It was trained on ~5 billion images. 1 gigabyte is a billion bytes. It is effectively doing something like, compressing a 512x512x3 byte image into just a single byte. This is transformative, so fair use is a valid defense, imo.

petseminary t1_iv1lbgu wrote on November 4, 2022 at 4:25 PM

It ain't shit without all the human effort that went into creating the training data. To my displeasure, I think the law will see it your way, but I don't think people should be so flippant about marginalizing over so much human creative effort. I have no problem with acquiring the rights to photos to train image generators, because that's the true cost of these products. It has nothing to do with final file size.

kaibee t1_iv1rktu wrote on November 4, 2022 at 5:05 PM

> It ain't shit without all the human effort that went into creating the training data. To my displeasure, I think the law will see it your way, but I don't think people should be so flippant about marginalizing over so much human creative effort. I have no problem with acquiring the rights to photos to train image generators, because that's the true cost of these products. It has nothing to do with final file size.

I'm not sure what you mean by 'marginalizing'. The contribution of the artists is valid and necessary. I know a lot the "common folk" in the SD community enjoy that some artists are upset by this whole thing, but like, I think on the whole the community is supportive of artists.

Though, I do have another angle here: Copyright is absolutely out of control and the vast majority of it at this point is accruing for the benefit of Disney as a result of lobbying on behalf of Disney and others. I think it is fundamentally absurd that children can grow up with beloved characters and die of old age before the copyright on those characters expires. And that's kind of the whole issue here right? Like, if artists wanted a 20 year copyright term on something, I think that is good and reasonable. They should be able to exclude their images from training data. I'd even be in favor of going as far as to say that there should be some associated metadata to facilitate that and that the government should enforce compliance, artists should be able to sue, etc the whole 9 yards.

But lets even say we keep copyright as it is: death of the author + whatever number of decades. Even if you could enforce the law (I can't even imagine how you would, especially in the coming years), all this does is push the problem for artists out until either models get better at learning from less data (so that you can make do with the far more limited amount of training data you buy the rights for) or enough data enters the public domain.

The Luddites weren't wrong. They really did suffer as a result of technological disruption. As with all things, the solution is a basic income funded by a land-value-tax.

petseminary t1_iv26lvl wrote on November 4, 2022 at 6:42 PM

I agree with you here. I think a reasonable example is the Wayback Machine. Very useful for archiving web content that has disappeared for whatever reason (usually lapse of web hosting). But if site/content creators want their content excluded, the Wayback Machine operators are very responsive and will stop hosting this content. I anticipate that asking for your content to be excluded from training sets after the fact will be much less pleasantly received, as the model would have to be relearned and this is expensive.

Living-Substance-668 t1_iv22uy0 wrote on November 4, 2022 at 6:17 PM

That may be, but either way there has been a dramatic transformation of the original works. Copyright is not an infinitely extended ownership right over information. It is a special exception (to free speech and press) we offer conditionally encourage people to produce things by allowing them to exclusively profit from their production. Like patents. Copyright does not prohibit producing a "similar" work to a copyrighted work, or using similar techniques as a copyrighted work, or else every drawing of a soup can would owe royalties to Andy Warhol

jarkkowork t1_iuzyk96 wrote on November 4, 2022 at 7:04 AM

Probably similar chances of that happening as with humans whose creativity is much based on subconsciously mimicing works they have already seen

hybridteory t1_iv0f5y5 wrote on November 4, 2022 at 11:00 AM

Yes, I find it incredibly strange that when speaking about Codex, everyone is worried about the models regurgitating the code they have been trained on while citing GPL and other licenses; but this seems to not be that much of an issue when it comes to images (given anecdotal evidence from these discussions), even though they themselves have licenses. It just goes to show that humans perceive text and images very differently from a creative point of view.

farmingvillein t1_iv2bbw6 wrote on November 4, 2022 at 7:13 PM

If there can be a lawsuit, there eventually certainly will be one.
The issues here are--for now--different. The current claim is that Codex is copy-pasting things that need licenses attached. (Whether this is true will of course be played out in court.) For image generation, no one has made the claim--yet--that these systems are emitting straight copies (at any meaningful scale) of someone else's original pictures.

hybridteory t1_iv2ebe5 wrote on November 4, 2022 at 7:33 PM

Codex is not technically copy pasting; it is generating a new output that is (almost) exactly the same, or indistinguishable on the eyes of a human, to the input. Sounds like semantics, but there is no actual copying. You already have music generating algorithms that can also generate short samples that are indistinguishable to the inputs (memorisation). Dall-E 2 is not there yet, but we are close to prompting "Original Mona Lisa painting" and be given back the original Mona Lisa painting with striking similarities. There are already several generative models of images that can mostly memorise inputs used to train it (quick example found using google: https://github.com/alan-turing-institute/memorization).

farmingvillein t1_iv2vqmx wrote on November 4, 2022 at 9:32 PM

> Codex is not technically copy pasting; it is generating a new output that is (almost) exactly the same, or indistinguishable on the eyes of a human, to the input.

Nah, it is literally generating duplicates. This is copying, in the eyes of the law. Whether this is an actual legal problem remains to be seen.

> Dall-E 2 is not there yet, but we are close to prompting "Original Mona Lisa painting" and be given back the original Mona Lisa painting with striking similarities.

This is confused. Dall-E 2 is "not there yet", as a general statement, because they specifically have trained it not to do this.

hybridteory t1_iv30cij wrote on November 4, 2022 at 10:06 PM

There is nothing about diffusion models that stop it from memorising data. Dall-E 2 can definitely memorise.

farmingvillein t1_iv38uzt wrote on November 4, 2022 at 11:11 PM

That is my point? I'm not sure how to square your (correct) statement with your prior statement:

> Dall-E 2 is not there yet, but we are close to prompting "Original Mona Lisa painting" and be given back the original Mona Lisa painting with striking similarities

[D] DALL·E to be made available as API, OpenAI to give users full ownership rights to generated images

ComplexColor t1_iuzt3zw wrote on November 4, 2022 at 5:49 AM