Comments

You must log in or register to comment.

yaosio t1_iuz4uo0 wrote

There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

36

CapaneusPrime t1_iuzgsq3 wrote

>There is an argument that co-pilot outputting open source code without credit or the license breaks the license. It will output stuff from open source projects verbatim (I can't find the link, maybe it was in Twitter? I can't back this up.), so this isn't a case where the code is inspired by the code, it really is the code and has to abide by the license.

There is an argument that this doesn't matter (from GitHub's perspective).

It's already been pretty well established that AI can be trained on copyrighted photos without issue.

That said image generating AI can produce works which infringe on copyright. So, Copilot could certainly produce code covered by a license. Which would then possibly lead to the Copilot user being in violation of the license.

That said...

While code is copyrighted, the protections of that copyright aren't absolute.

For instance, I don't think anyone would doubt that there are examples of code under license which includes elements lifted from elsewhere without attribution—stackoverflow, etc—for which the author would not have a valid claim of authorship.

But, even for people who wrote their own code, 100% by scratch, there are limitations.

If the copying is a very small element of the whole it's less likely to be problematic.

If the code represents a standard method of doing something or if there's only a few ways to accomplish what the code does, it's not likely to be able to be copyrightable.

Now, the vast majority of my programming work is done in purely functional programming languages—object-oriented languages have much more opportunity for creative expression. I write a lot of code implementing algorithms, most of which are very complex, and I'd be very hard pressed to justify claiming the copyright on most of the code I write.

Regardless of how clever I think some of my code may be, I'm also certain that any other competent person implementing the same algorithm would end up with code >95% essentially identical to mine.

Honestly, I don't see this lawsuit going anywhere, as I understand it, any copied snippets are fairly short and standard.

21

pseudorandom_user t1_iuzm0en wrote

I wonder if anyone is considering adding a clause to their open source license stating that it can't be used to train ai language models.

1

FranciscoJ1618 t1_iuzn2va wrote

The end of programmers is very close, but I think this was going to happen regardless of AI. Programming communities have always acted against their own self-interest, with some kind of cult mindset and ignoring basic economics rules, in particular those related to free software (software libre). They'll learn the hard way that more programmers = lower salary and sharing your source code was a very stupid idea.

−14

TiredOldCrow t1_iuzp1y3 wrote

I appreciate that the legendary "fast inverse square root" code from Quake 3 gets produced verbatim, comments and all, if you start with "float Q_rsqrt".

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;                       // evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

	return y;
}

I'm interested in how practical it will be for a motivated attacker to poison a code generation models with vulnerable code. Also curious to what extent these models produce code that only works with outdated and vulnerable dependencies -- a problem you'll also run into if you naively copy old StackOverflow posts. I've recently been working on threat models in natural language generation, but it seems like threat models in code generation are also going to be interesting.

Edit: Not John Carmack!

22

FoundationPM t1_iuzw25q wrote

Source code in Github should not be used to train a commercial product. Why would they do that? To benefit the human being? I doubt. It doesn't belong to Github even if under GNU/MIT or any licence.

−5

Takahashi_Raya t1_iv0ammn wrote

>It's already been pretty well established that AI can be trained on copyrighted photos without issue.

It hasn't that is why ghetty has blocked ai and the art world is incredibly hatefull against AI and moving in the same way the creators started this lawsuit. There is a reason why university's have law and ethics classes regarding AI where it is explicitly told to not train on anything that is not public domain or licensed.

The fact facial recognition waa trained on millions of foto's that where present on facebook is still a sore sting in many people's minds. Dont confuse AI startups ignoring ethics and laws with reality.

If this lawsuit is a succes expect the ai tech world to be on fire very quickly. IP lawyers are frothing at their mouths for a while to get a slice of this

6

Alikont t1_iv0oi1g wrote

> includes elements lifted from elsewhere without attribution—stackoverflow

Users of Stackoverflow sign that their code snippets are public and no attribution is required as a part of Stackoverflow TOS

9

killver t1_iv0rejo wrote

> It's already been pretty well established that AI can be trained on copyrighted photos without issue.

This is one of the biggest misconceptions in AI at this point. This is just not true.

4

CapaneusPrime t1_iv0t6pr wrote

You missed the point, I'm not making a spurious "whataboutism" claim.

Attribution is required by copyright.

If I take a snippet of code from stackoverflow and put it in my open source project, that's fine.

Nobody is saying it isn't.

What isn't fine is slapping a license on a file which includes that code without specifying that that code isn't subject to the license—that's claiming ownership of something which isn't yours and trying to attach a license to it.

Beyond that, you really need to re-read the stackoverflow ToS, because they don't quite say what you seem to think.

4

Ronny_Jotten t1_iv0vo01 wrote

That decision wasn't about copyrighted photos. It was about Google creating a books search index, which was allowed as fair use - just like their scanning of books for previews is. That's an entirely different situation than if Google had trained an AI to write books for sale, that contained snippets or passages from the digitized books.

The latter certainly would not be considered fair use under the reasoning given by the judge in the case. He found that the search algorithm maintained:

> consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders

and that its incorporation into the Google Books system works to increase the sales of the copyrighted books by the authors. None of this can be said about Microsoft's product. It would seem to clearly fail the tests for fair use.

3

fmai t1_iv0xglf wrote

Even if it is judged to be illegal, I hope that countries quickly come to pass new legislation that makes it legal in the future.

4

killver t1_iv0ycz4 wrote

If you trust a random blog, go ahead.

This ruling was for a very specific use case that cannot be generalized, and also only applies to US, even only a specific district. It is also totally unclear how it applies to generative model, which even the blog cited recognizes.

The AI community just loves to trust this as it is the easy and convenient thing to do.

Also see a reply to this post you shared: https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

5

killver t1_iv0z4kt wrote

I am even more concerned that they send my non-public / proprietary code back to evaluate the responses and save it to improve the models. I still could not find a clear statement that they are not doing it.

−1

waffles2go2 t1_iv1c4z3 wrote

>https://medium.com/@brianjleeofcl/this-piece-should-be-retracted-ca740d9a36fe

Relevant bits - perhaps spouting off with a N=1 isn't the best look...

In practice, when SCOTUS denies the petition, the ruling made by the relevant appellate court is a legal precedent only within the the district (Second) where the circuit court has made its ruling. This means that a different court—say, the Ninth, which includes Silicon Valley—could go ahead and issue a ruling that directly opposes that of the Second. At this point, it becomes more likely that SCOTUS would grant cert since it would be a problem that under the same federal legal code, two opposing versions of case law could exist; after which the court would hear arguments and then finally issue a decision. Until that hypothetical occurs, there is no precedent set by a SCOTUS decision to note in this matter.

So a programmer who doesn't understand the law should take a harder look at what they post on Reddit unless the like being totally owned...

3

CapaneusPrime t1_iv1lheh wrote

>That decision wasn't about copyrighted photos.

And every knowledge person agrees this protects images as well.

Training a generative AI does not adversely impact the rights of artists.

This is really transformative fair use.

0

themrzmaster t1_iv1r1vk wrote

Your knowldge comes from open source projects. That does’nt mean you are violating licenses when you write “new” code.

0

farmingvillein t1_iv1x778 wrote

> It hasn't that is why ghetty has blocked ai

You are right that OP is wrong (re:whether this is a settled legal issue)...but let's not pretend that ghetty [sic] doing so has to do with anything than attempted revenue maximization on their part.

Successful, prolific AI art devalues their portfolio of images, and they know that.

2

farmingvillein t1_iv1ye88 wrote

Which seems like a solvable, albeit terribly painful, problem?

If this is the direction that things end up going, honestly this will ultimately only be massively in the favor of OpenAI (and a small # of very well-funded competitors), as it will create a very, very painful barrier to entry.

1

Takahashi_Raya t1_iv2c6qi wrote

Getty and shutterstock partnered with OpenAI (creators of Dall-E) and with BRIA. Both company's who's training data has been confirmed to be ethically sourced and only contain public DOMAIN images and images they have licenses too.

the ones who are under scrutiny from community's are Midjourney, stablediffusion & novelAI. when it comes to image gen due to them not adhering to the ethics in AI data usage.

OpenAI is mentioned in the current main topic of Co-Pilot as well due to microsoft using their codex model as part of co-pilot but that doesn't change that Dall-E is ethically used.

1

race2tb t1_iv3xozn wrote

Not going to matter in the longer term. These early models are still in their generative infancy. You will not be able to tell at all in the future where the code came from unless you ask for it verbatim from an known example.

2

chatterbox272 t1_iv4kbwb wrote

>It will output stuff from open source projects verbatim

I've seen this too, however only in pretty artificial circumstances. Usually in empty projects, and with some combination of exact function names/signatures, detailed comments, or trivially easy blocks that will almost never be unique. I've never seen an example posted in-context (in an existing project with it's own conventions) where this occurred.

>One solution without messing with co-pilot training or output is to have a second program look at code being generated to see if it's coming from any of the open source projects on gitbub and let the user know so they can abide by the license.

This kinda exists, there is a setting to block matching open-source code although reportedly it isn't especially effective (then again, I've only seen this talked about by people who also report frequent copy-paste behaviour, something I've not been able to replicate in normal use).

2

multiedge t1_iwiakg3 wrote

There's also a problem of this being used to scam Microsoft. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

1

multiedge t1_iwib7xx wrote

There's also a problem of this being a scam. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

1

multiedge t1_iwibhdi wrote

Just repeating what I said somewhere else, but there's also a chance this could be a scam as well. I mean, I could license my code and publish it on github, then create another several github accounts and reuse that licensed code to be picked up by Copilot. I will then have legal grounds to sue them for using my licensed code.

1