[deleted] t1_iuz456p wrote on November 4, 2022 at 1:59 AM

#439,394

[deleted]

yaosio t1_iuz4uo0 wrote on November 4, 2022 at 2:04 AM

#439,434

There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

CapaneusPrime t1_iuzgsq3 wrote on November 4, 2022 at 3:39 AM

#439,899

Replying to yaosio (#439,434)

>There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

There is an argument that this doesn't matter (from GitHub's perspective).

It's already been pretty well established that AI can be trained on copyrighted photos without issue.

That said image generating AI can produce works which infringe on copyright. So, Copilot could certainly produce code covered by a license. Which would then possibly lead to the Copilot user being in violation of the license.

That said...

While code is copyrighted, the protections of that copyright aren't absolute.

For instance, I don't think anyone would doubt that there are examples of code under license which includes elements lifted from elsewhere without attribution—stackoverflow, etc—for which the author would not have a valid claim of authorship.

But, even for people who wrote their own code, 100% by scratch, there are limitations.

If the copying is a very small element of the whole it's less likely to be problematic.

If the code represents a standard method of doing something or if there's only a few ways to accomplish what the code does, it's not likely to be able to be copyrightable.

Now, the vast majority of my programming work is done in purely functional programming languages—object-oriented languages have much more opportunity for creative expression. I write a lot of code implementing algorithms, most of which are very complex, and I'd be very hard pressed to justify claiming the copyright on most of the code I write.

Regardless of how clever I think some of my code may be, I'm also certain that any other competent person implementing the same algorithm would end up with code >95% essentially identical to mine.

Honestly, I don't see this lawsuit going anywhere, as I understand it, any copied snippets are fairly short and standard.

pseudorandom_user t1_iuzm0en wrote on November 4, 2022 at 4:28 AM

#440,095

I wonder if anyone is considering adding a clause to their open source license stating that it can't be used to train ai language models.

FranciscoJ1618 t1_iuzn2va wrote on November 4, 2022 at 4:39 AM

#440,138

The end of programmers is very close, but I think this was going to happen regardless of AI. Programming communities have always acted against their own self-interest, with some kind of cult mindset and ignoring basic economics rules, in particular those related to free software (software libre). They'll learn the hard way that more programmers = lower salary and sharing your source code was a very stupid idea.

TiredOldCrow t1_iuzp1y3 wrote on November 4, 2022 at 5:00 AM

#440,212

I appreciate that the legendary "fast inverse square root" code from Quake 3 gets produced verbatim, comments and all, if you start with "float Q_rsqrt".

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &amp;y;                       // evil floating point bit level hacking
	i  = 0x5f3759df - ( i &gt;&gt; 1 );               // what the fuck? 
	y  = * ( float * ) &amp;i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

	return y;
}

I'm interested in how practical it will be for a motivated attacker to poison a code generation models with vulnerable code. Also curious to what extent these models produce code that only works with outdated and vulnerable dependencies -- a problem you'll also run into if you naively copy old StackOverflow posts. I've recently been working on threat models in natural language generation, but it seems like threat models in code generation are also going to be interesting.

Edit: Not John Carmack!

ReasonablyBadass t1_iuzpvgh wrote on November 4, 2022 at 5:10 AM

#440,241

Replying to pseudorandom_user (#440,095)

Wouldn't be mich better to state any AI derived code from this will automatically be Open Source?

FoundationPM t1_iuzw25q wrote on November 4, 2022 at 6:28 AM

#440,439

Source code in Github should not be used to train a commercial product. Why would they do that? To benefit the human being? I doubt. It doesn't belong to Github even if under GNU/MIT or any licence.

Takahashi_Raya t1_iv0ammn wrote on November 4, 2022 at 10:03 AM

#440,963

Replying to CapaneusPrime (#439,899)

>It's already been pretty well established that AI can be trained on copyrighted photos without issue.

It hasn't that is why ghetty has blocked ai and the art world is incredibly hatefull against AI and moving in the same way the creators started this lawsuit. There is a reason why university's have law and ethics classes regarding AI where it is explicitly told to not train on anything that is not public domain or licensed.

The fact facial recognition waa trained on millions of foto's that where present on facebook is still a sore sting in many people's minds. Dont confuse AI startups ignoring ethics and laws with reality.

If this lawsuit is a succes expect the ai tech world to be on fire very quickly. IP lawyers are frothing at their mouths for a while to get a slice of this

ClearlyCylindrical t1_iv0d7hp wrote on November 4, 2022 at 10:36 AM

#441,080

Replying to TiredOldCrow (#440,212)

the q_rsqrt being produced verbatim is probably due to identical code existing in many areas of the training data.

pm_me_your_ensembles t1_iv0ea9x wrote on November 4, 2022 at 10:50 AM

#441,120

Replying to pseudorandom_user (#440,095)

You can opt out of Github's source collecting program.

chasingourselves t1_iv0gju2 wrote on November 4, 2022 at 11:15 AM

#441,226

Replying to ReasonablyBadass (#440,241)

The problem is that many open source licenses are mutually incompatible— e.g. Affero GPL vs BSD vs GPLv3. So you’d need per-snippet code licensing.

dojoteef t1_iv0hfoe wrote on November 4, 2022 at 11:25 AM

#441,256

Replying to TiredOldCrow (#440,212)

Slightly off-topic: I'm a huge John Carmack fan, but he isn't the author of that code. It's just part of engine code that his company released for the game Quake 3 Arena. For details, check out:

https://www.beyond3d.com/content/articles/8/

CapaneusPrime t1_iv0npgi wrote on November 4, 2022 at 12:26 PM

#441,641

Replying to Takahashi_Raya (#440,963)

You are wrong.

https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf

Alikont t1_iv0oi1g wrote on November 4, 2022 at 12:33 PM

#441,674

Replying to CapaneusPrime (#439,899)

> includes elements lifted from elsewhere without attribution—stackoverflow

Users of Stackoverflow sign that their code snippets are public and no attribution is required as a part of Stackoverflow TOS

killver t1_iv0rejo wrote on November 4, 2022 at 12:57 PM

#441,905

Replying to CapaneusPrime (#439,899)

> It's already been pretty well established that AI can be trained on copyrighted photos without issue.

This is one of the biggest misconceptions in AI at this point. This is just not true.

TiredOldCrow t1_iv0s1bg wrote on November 4, 2022 at 1:02 PM

#441,945

Replying to dojoteef (#441,256)

Great read, thanks for that. Updated the comment.

CapaneusPrime t1_iv0t6pr wrote on November 4, 2022 at 1:12 PM

#442,013

Replying to Alikont (#441,674)

You missed the point, I'm not making a spurious "whataboutism" claim.

Attribution is required by copyright.

If I take a snippet of code from stackoverflow and put it in my open source project, that's fine.

Nobody is saying it isn't.

What isn't fine is slapping a license on a file which includes that code without specifying that that code isn't subject to the license—that's claiming ownership of something which isn't yours and trying to attach a license to it.

Beyond that, you really need to re-read the stackoverflow ToS, because they don't quite say what you seem to think.

CapaneusPrime t1_iv0t82o wrote on November 4, 2022 at 1:12 PM

#442,016

Replying to killver (#441,905)

https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf

Ronny_Jotten t1_iv0vo01 wrote on November 4, 2022 at 1:31 PM

#442,171

Replying to CapaneusPrime (#441,641)

That decision wasn't about copyrighted photos. It was about Google creating a books search index, which was allowed as fair use - just like their scanning of books for previews is. That's an entirely different situation than if Google had trained an AI to write books for sale, that contained snippets or passages from the digitized books.

The latter certainly would not be considered fair use under the reasoning given by the judge in the case. He found that the search algorithm maintained:

> consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders

and that its incorporation into the Google Books system works to increase the sales of the copyrighted books by the authors. None of this can be said about Microsoft's product. It would seem to clearly fail the tests for fair use.

fmai t1_iv0xglf wrote on November 4, 2022 at 1:44 PM

#442,277

Even if it is judged to be illegal, I hope that countries quickly come to pass new legislation that makes it legal in the future.

[deleted] t1_iv0xj39 wrote on November 4, 2022 at 1:45 PM

#442,284

Replying to killver (#441,905)

[deleted]

[deleted] t1_iv0xvpx wrote on November 4, 2022 at 1:47 PM

#442,310

Replying to FoundationPM (#440,439)

[deleted]

killver t1_iv0ycz4 wrote on November 4, 2022 at 1:51 PM

#442,340

Replying to CapaneusPrime (#442,016)

If you trust a random blog, go ahead.

This ruling was for a very specific use case that cannot be generalized, and also only applies to US, even only a specific district. It is also totally unclear how it applies to generative model, which even the blog cited recognizes.

The AI community just loves to trust this as it is the easy and convenient thing to do.

Also see a reply to this post you shared: https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

killver t1_iv0z4kt wrote on November 4, 2022 at 1:56 PM

#442,370

I am even more concerned that they send my non-public / proprietary code back to evaluate the responses and save it to improve the models. I still could not find a clear statement that they are not doing it.

waffles2go2 t1_iv1c4z3 wrote on November 4, 2022 at 3:25 PM

#443,083

Replying to CapaneusPrime (#441,641)

>https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

Relevant bits - perhaps spouting off with a N=1 isn't the best look...

In practice, when SCOTUS denies the petition, the ruling made by the relevant appellate court is a legal precedent only within the the district (Second) where the circuit court has made its ruling. This means that a different court—say, the Ninth, which includes Silicon Valley—could go ahead and issue a ruling that directly opposes that of the Second. At this point, it becomes more likely that SCOTUS would grant cert since it would be a problem that under the same federal legal code, two opposing versions of case law could exist; after which the court would hear arguments and then finally issue a decision. Until that hypothetical occurs, there is no precedent set by a SCOTUS decision to note in this matter.

So a programmer who doesn't understand the law should take a harder look at what they post on Reddit unless the like being totally owned...

CapaneusPrime t1_iv1k6gu wrote on November 4, 2022 at 4:18 PM

#443,475

Replying to waffles2go2 (#443,083)

I'm not a programmer and I do understand the law.

CapaneusPrime t1_iv1kpa2 wrote on November 4, 2022 at 4:21 PM

#443,506

Replying to killver (#442,340)

👌 Good luck with that.

CapaneusPrime t1_iv1lheh wrote on November 4, 2022 at 4:26 PM

#443,553

Replying to Ronny_Jotten (#442,171)

>That decision wasn't about copyrighted photos.

And every knowledge person agrees this protects images as well.

Training a generative AI does not adversely impact the rights of artists.

This is really transformative fair use.

themrzmaster t1_iv1r1vk wrote on November 4, 2022 at 5:02 PM

#443,826

Your knowldge comes from open source projects. That does’nt mean you are violating licenses when you write “new” code.

farmingvillein t1_iv1x778 wrote on November 4, 2022 at 5:41 PM

#444,166

Replying to Takahashi_Raya (#440,963)

> It hasn't that is why ghetty has blocked ai

You are right that OP is wrong (re:whether this is a settled legal issue)...but let's not pretend that ghetty [sic] doing so has to do with anything than attempted revenue maximization on their part.

Successful, prolific AI art devalues their portfolio of images, and they know that.

farmingvillein t1_iv1ye88 wrote on November 4, 2022 at 5:49 PM

#444,225

Replying to chasingourselves (#441,226)

Which seems like a solvable, albeit terribly painful, problem?

If this is the direction that things end up going, honestly this will ultimately only be massively in the favor of OpenAI (and a small # of very well-funded competitors), as it will create a very, very painful barrier to entry.

Takahashi_Raya t1_iv20ewh wrote on November 4, 2022 at 6:01 PM

#444,314

Replying to farmingvillein (#444,166)

I mean that is very much part of it but it is indeed not the only reason ghetty did that.

farmingvillein t1_iv29u6c wrote on November 4, 2022 at 7:03 PM

#444,823

Replying to Takahashi_Raya (#444,314)

Getty and Shutterstock literally turned around and partnered with generative AI companies--who do exactly what you flag as a problem--to sell images on their platforms.

Takahashi_Raya t1_iv2c6qi wrote on November 4, 2022 at 7:19 PM

#444,952

Replying to farmingvillein (#444,823)

Getty and shutterstock partnered with OpenAI (creators of Dall-E) and with BRIA. Both company's who's training data has been confirmed to be ethically sourced and only contain public DOMAIN images and images they have licenses too.

the ones who are under scrutiny from community's are Midjourney, stablediffusion & novelAI. when it comes to image gen due to them not adhering to the ethics in AI data usage.

OpenAI is mentioned in the current main topic of Co-Pilot as well due to microsoft using their codex model as part of co-pilot but that doesn't change that Dall-E is ethically used.

[deleted] t1_iv3f3a3 wrote on November 4, 2022 at 11:58 PM

#446,754

Replying to pseudorandom_user (#440,095)

[removed]

race2tb t1_iv3xozn wrote on November 5, 2022 at 2:29 AM

#447,553

Not going to matter in the longer term. These early models are still in their generative infancy. You will not be able to tell at all in the future where the code came from unless you ask for it verbatim from an known example.

chatterbox272 t1_iv4kbwb wrote on November 5, 2022 at 6:34 AM

#448,353

Replying to yaosio (#439,434)

>It will output stuff from open source projects verbatim

I've seen this too, however only in pretty artificial circumstances. Usually in empty projects, and with some combination of exact function names/signatures, detailed comments, or trivially easy blocks that will almost never be unique. I've never seen an example posted in-context (in an existing project with it's own conventions) where this occurred.

>One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

This kinda exists, there is a setting to block matching open-source code although reportedly it isn't especially effective (then again, I've only seen this talked about by people who also report frequent copy-paste behaviour, something I've not been able to replicate in normal use).

Brilliant_Aspect_201 t1_ivdtb7p wrote on November 7, 2022 at 5:48 AM

#461,901

Replying to killver (#441,905)

Using peoples work for ANYTHING without permission is ILLEGAL!

multiedge t1_iwiakg3 wrote on November 15, 2022 at 9:02 PM

#539,251

Replying to CapaneusPrime (#439,899)

There's also a problem of this being used to scam Microsoft. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

multiedge t1_iwib7xx wrote on November 15, 2022 at 9:06 PM

#539,289

Replying to chatterbox272 (#448,353)

There's also a problem of this being a scam. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

multiedge t1_iwibhdi wrote on November 15, 2022 at 9:08 PM

#539,298

Just repeating what I said somewhere else, but there's also a chance this could be a scam as well. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

multiedge t1_iwibp9t wrote on November 15, 2022 at 9:09 PM

#539,312

Replying to multiedge (#539,298)

The bigger issue might be that there's movement from this anti-AI group of people who peddle irresponsible and uniformed stuff regarding AI. https://www.youtube.com/watch?v=IQJVWN_-jB8

multiedge t1_iwibyuo wrote on November 15, 2022 at 9:11 PM

#539,325

Replying to pseudorandom_user (#440,095)

I could do that, then I would also create several github accounts and reuse my licensed code to be picked up by copilot. I could then sue microsoft for using my code >:)

Comments