Comments

You must log in or register to comment.

professorDissociate t1_j99carm wrote

The remediation reporting the medias response to OpenAI including the media in training material. I feel like I’ve heard this one before but with Putin assassinating himself.

6

stealytheblackguy t1_j99cby0 wrote

They also just straight up used dialogue created by the developers. Chatgpt is heavily biased.

9

hikeonpast t1_j99d0km wrote

“Maybe we can charge AI to read our articles, since humans won’t pay for our content anymore”, said Sensationalist “Biased” McMedia.

7

egypturnash t1_j99o9kf wrote

Oh good, maybe this will result in “fair use” being defined to explicitly not include some asshole scraping the Internet and dumping everything they find into their copyright-washing “AI”.

12

Slippedhal0 t1_j99zasl wrote

It's the same argument that artist's complaining about using copyrighted artwork as training data.

At some point there will be a major ruling about how companies training AI need to approach copyright for their training data sources, and if they rule in favour of copyright holders it will probably severely slow AI progress as systems to request permission are built.

Although I could maybe see a fine-tuned AI like bing being less affected because it cites sources rather than opaquely uses previously acquired knowledge

1

gurenkagurenda t1_j9a0m4m wrote

> Marconi said he asked the chatbot for a list of news sources it was trained on and received a response naming 20 outlets.

I see absolutely no reason to think that ChatGPT can answer this question accurately, and expect that it is hallucinating this answer. Its training process isn’t something it “remembers” like someone would remember their time in high school. Instead, its thought process is more like “what would a conversational response from a language model look like?”

That’s not to say that it wasn’t trained on those sources, but you have to understand the limitations of the model. Asking it about its training process is like asking a human about their evolutionary history. Unless they’ve been explicitly taught about that, they just don’t know.

71

UmdieEcke2 t1_j9a5bby wrote

Yeah, reading things and then using the information is the most deplorable action any actor can do. Thank god humans are above such disgusting behaviour. Imagine the dystopia we would be living in otherwise.

6

gurenkagurenda t1_j9adicl wrote

I cannot see any possible way to define fair use the way you’re saying which wouldn’t have massive unintended effects. If you want to propose that, you’re going to need to be a hell of a lot more specific than “dumping into an AI” when describing what you think should actually be prohibited.

12

gurenkagurenda t1_j9ae1ic wrote

I don’t think it will slow AI at this point, so much as it will concentrate control over AI even more into the hands of well funded, established players. OpenAI has already hired an army of software developer contractors to produce training data for Codex. The same could be done even more cheaply for writers. The technology is proven now, so there’s no risk anymore. We know that you just need the training data.

So the upshot would just be a higher barrier to entry. Training a new model means not only funding the compute, but also paying to create the training set.

8

bairbs t1_j9ag5if wrote

Why not? Just say scraping is fine for research and private models. As soon as you release it to the public or try to monetize it, then it's outside of fair use. Just like Nintendo, when they go after passion project games that are similar in theme, style, and mechanics. You can't just take other people's work and make money off of it

−5

bairbs t1_j9agew9 wrote

People can do whatever they want with copyright privately. It's when you release the work or try to commercialize it that causes the problems. Nothing is stopping AI companies from scraping and training all day. In order to release it, they should compensate the copyright holders

0

bairbs t1_j9agwyq wrote

Exactly. This is what big tech has been doing already to create legal and ethical data.

The training data is the bottleneck. OpenAI is trying to see if they can pull a fast one by releasing models using copyrighted material

−1

gurenkagurenda t1_j9allgk wrote

How do you define a model? What statistics are you and are you not allowed to scrape and publish? Comments like yours speak to a misunderstanding of what training is with respect to a work, which is simply nudging some numbers according to the statistical relationships within the text. That’s an incredibly broad category of operations.

For example, if I scrape a large number of pages, and analyze the number of incoming and outgoing links, and how those links relate to other links, in order to build a model that lets me match a phrase to a particular webpage and assess its relevance, is that fair use?

If not, you just outlawed search engines. If so, what principle are you using to distinguish that from model training?

Edit: Gotta love when someone downvotes you in less time than it would take to actually read the comment. Genuine discourse right there.

9

gurenkagurenda t1_j9amk4h wrote

They’re not “pulling a fast one”. There’s no precedent here, and there’s a boatload of lawyers who agree that this is fair use. There are also a number who believe that it won’t be. The courts will have to figure it out, but until then, nobody knows how it will play out.

4

bairbs t1_j9an2db wrote

I'm speaking about using copyrighted art, music, etc. I understand what training is. I also understand the steps companies take to prevent even the perception that they're training on copyrighted material. They either generate pseudo data or purchase entire libraries from stock photo sites. OpenAI and by extension, Microsoft are hoping they can get enough people on their side by saying, "Nothing is copyright if you think about it," so they can do whatever they like.

−5

bairbs t1_j9ao00n wrote

They actually are. The precedent has been to use public domain material (which is why there are so many fine art style GANs), create your own data, pay for data to be created, pay for existing data, or keep the models private. There are plenty more artists and other jobs than lawyers who know this isn't fair use and will be negatively impacted if these companies are allowed to continue this practice.

1

ImSuperHelpful t1_j9apjid wrote

Your argument neglects the business side of the situation which explains the motivations to allow and disallow use in the two scenarios… if I run a content website, a search engine crawling the site so it can generate search results which send traffic to my site is beneficial to both parties, it’s symbiotic.

Alternatively, if I run a content site that an AI company crawls and then uses to train a model which then negates the need for my site to would-be visitors, it’s parasitic.

0

Slippedhal0 t1_j9asf0r wrote

Technically thats not correct, its just very hard to enforce private use. For example, if you copy a movie, even for prvate use(except very specific circumstances) thats illegal, and people have been charged.

That said, the public release point is what I was thinking of anyway.

5

Special_Rice9539 t1_j9at12x wrote

It turns out that chatGPT isn't actually an AI but just has well-trained staff in the background answering your prompts.

5

[deleted] t1_j9atsku wrote

Can we all just agree to stop posting 743.6448 articles a day about Chat GPT?

Alternatively a filter to filter all of them out would be lovely too

17

gurenkagurenda t1_j9avebb wrote

I'm not neglecting anything. I'm asking for some semblance of precision in defining model training out of fair use. The purpose and character of use, and the effect on the market are already factors in fair use decisions, but that's a lot more complicated of an issue than "AI models can't scrape content." It's specific to the application, and even for ChatGPT specifically, it would be pretty murky.

3

gurenkagurenda t1_j9avgd7 wrote

So you think that search engines should be considered illegal copyright infringement? You say that you're just referring to scraping content, which is a necessary part of how a search engine works. So I'm forced to assume that the answer is yes.

0

Sigma_Atheist t1_j9avhlh wrote

God forbid anyone actually read our articles. After all, what are headlines for?

0

gurenkagurenda t1_j9aw1wh wrote

I think the current volume of ChatGPT articles would actually be tolerable if the media would actually focus on interesting aspects of the subject. But they just keep playing the same four notes over and over agin. At least this one isn't "<recognizable name in tech> thinks <opinion> about ChatGPT, but also says <slightly different opinion>"

6

ImSuperHelpful t1_j9awd5k wrote

Except that’s what was missing from your original point, but either way I gave you a starting point… if it’s beneficial for both parties and both parties consent (which content site operators do via robot.txt instructions), no one has a problem. In the AI case it’s beneficial to the AI creator/owner but harmful to the content owner since the AI is competing with them by using their content, so it shouldn’t be considered free use.

−1

bairbs t1_j9awnxc wrote

Technically, if you bought the movie, you could copy it for your own use. You just can't share it, which to your point is very hard to enforce for private use outside of the internet.

I'm thinking of fair use when I say "do whatever they want with copyright privately"

1

Twombls t1_j9axe91 wrote

Soon enough we will get to ignore it. This reminds me of self driving cars in 2015 reddit got flooded with wierd hypebeasts and then the tech progress slowed way down.

The hype is reaching unrealistic levels. Subreddits dedicated to chatgtp and bing are essentially cults who believe its sentient at this point. Soon we will get to the trough of disappointment as this tech gets deployed to the general public and people start finding its faults

2

bairbs t1_j9axo6n wrote

Lol, you're the one bringing search engines into this for some reason. It's a disingenuous argument and way off base from my point, which is why I'm not responding to it. You've also found all my comments and responded to them agressuvely like a good shill

0

bairbs t1_j9aygil wrote

Lol, if you think these huge companies don't have teams of lawyers advising them on how to legally create models, you're nuts. OpenAI has everything to gain and nothing to lose by trying to challenge the precedents that are already set.

But keep doing your own research. Maybe they'll hire you (or maybe they already do)

0

Lick-a-Leper t1_j9b4eo4 wrote

There is a decent portion of internet articles and opinion pieces written by AI already . It's been happening for a few years. It's interesting that AI is teaching AI to be flawed

1

gurenkagurenda t1_j9b8gzz wrote

> OpenAI has everything to gain and nothing to lose by trying to challenge the precedents that are already set.

Please cite the case that you're talking about which you claim sets this precedent. Thanks.

3

gurenkagurenda t1_j9b8p1r wrote

>You've also found all my comments and responded to them agressuvely like a good shill

Are you talking about this? You replied to me.

I mean Jesus Christ. Anyway, I'm done trying to explain to the concept of unintended consequences to you.

2

gurenkagurenda t1_j9b9muc wrote

>Except that’s what was missing from your original point

Again, it's not missing from my original point, because my original point was to ask how the commenter above was distinguishing these cases. You've given a possible answer. That's an answer to my question, not a rebuttal.

I don't think that answer is very compelling, though. Arguing that an explicitly unreliable chat bot that hallucinates as often as it tells the truth is somehow a competitor to news media etc. is a tall order.

1

littleMAS t1_j9bbixo wrote

Imagine that you were a true genius with an amazing 'photographic' memory that could recount almost everything you ever read. Imagine winning awards, getting a premium 'Ivy League' education, publishing award-winning original essays, and becoming a revered scholar. Now, imagine every publication such as the WSJ coming after you for 'using' their published content to make yourself so smart.

3

zutnoq t1_j9bhoq8 wrote

Search providers like google don't just show you links though. They also show you potentially relevant excerpts so you often don't even need to go to the linked site to get what you were after, and show previews of images in image search etc.

Determining exactly where to draw the line of what to consider fair-use for things like this is a highly complex and dynamic issue. Web search engines are (by necessity) parasitic as well but that alone neither makes them bad nor illegal.

Parasitic is also not the "bad" counterpart of symbiotic. A symbiotic relationship is simply a parasitic relationship that benefits both parties. Just saying parasitic says nothing about which side(s) would benefit. I think exploitative would be a more appropriate word to use for such relationships.

3

ImSuperHelpful t1_j9biixs wrote

I didn’t present it as a rebuttal, I added important context that was missing from your question that makes the answer much more clear.

And these thing are unreliable now, but Microsoft and others are dumping billions of dollars into making them better and they’re doing it for profit. Waiting around until they’re perfected before fighting against the ongoing unfair use of copyrighted content is a sure fire strategy to losing that fight.

1

yUQHdn7DNWr9 t1_j9bn2ue wrote

You don’t need permission to read, memorise, analyse, synthesise, learn from, paraphrase, praise or criticise copyrighted text. You need permission to reproduce it. It isn’t obvious to me that a statistical model would need to reproduce the data it is studying.

1

ImSuperHelpful t1_j9bqacf wrote

Those relevant excerpts and similar features have been pretty detrimental to search click through rates in certain areas (they’re known as “no click” searches in the industry)… but the alternative is to block google bots entirely, which isn’t viable if you’re operating a content site since google has an effective monopoly on search. Also, those features do still link out to the content they’re showing on the SERP, whereas the chat ai doesn’t and gives the appearance that it’s the source of the information.

Your point about vocabulary is fair

2

AMirrorForReddit t1_j9brf47 wrote

...why would you not want to follow arguably the most important technological development of our time? Get used to it honey, and stop crying. It's only going to get worse. You don't get to pick and choose what it relevant, ok?

I just wish people could grow a brain about what the technology is, and stop pretending it is sentient and shit. That's all I desire. But it's never gonna happen.

1

Cakeking7878 t1_j9d07xc wrote

I actually did get it to generate a random YouTube link that wasn’t dead. It had 4 views and it was some families vacation video from 2015.

However I should stress, it was a random chance after several tries of me asking. Trying to pull factual or useful information from this is a dumb idea at best and a harmful idea at worst

2

DrabDonut t1_j9d14nb wrote

> the most important technological development of our time

It’s an early 1900s theory with a 1940s application using a 2020 level of data. It’s not much of a technological development beyond how much we could feed it. AI researchers in my department are kind of pissed that LLMs are getting this much attention because an astonishing number of humans are so dumb they can’t pass the mirror test.

4

AMirrorForReddit t1_j9deh5p wrote

How can it simply be a mirror test when a large part of it presents information the average individual does not have?

Just to be clear, I think I see where you are going with that statement, but I disagree that it is nuanced enough to make much sense.

Yeah, people are hopeless, it seems, at understanding what they are working with with ChatGPT.

0

GhostofDownvotes t1_j9e52yr wrote

Allegedly, I co-authored papers on organic chemistry with my cat. All the links it provided were real, but there was obviously no mention of organic chemistry or my cat co-authoring anything with me.

Plot twists: maybe I did co-author them with my cat and the competition just wiped my brain. 🧐

1