I love seeing all this great progress with LLMs being made more accessible to all, but all of the new efficient models (Dolly, Alpaca, etc.) depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use. Are there efforts in the community to replicate this dataset for commercial use? This seems to me to be the “secret sauce”: a good quality instruction dataset you can use to “unlock” potential of smaller models.

Comments

You must log in or register to comment.

KungFuScubaMaster t1_jdvmcyw wrote on March 27, 2023 at 3:06 PM

#2,392,869

Just adding, I'm also very interested in this!

big_ol_tender t1_jdvu92g wrote on March 27, 2023 at 3:57 PM

#2,394,008

Thank you for posting this. I’ve raised this issue on a number of threads and even opened an issue on the alpaca repo. Everyone seems to ignore this and I’m worried about downstream issues with these models, and would love an open source alternative ( have been exploring making one myself).

esquire900 t1_jdw02ut wrote on March 27, 2023 at 4:35 PM

#2,394,790

I wondered this as well. Generating one through chatGPT should be relatively cheap (in the range of ~50$ for 50.000k examples?), but I find the commercial use of it dubious. I can't really find any explicit statement on the license of data that comes out of chatGPT, or davinci or similar.

If some users here are interested, might be worth the effort to design some proper prompts, all put in some small amount and let GPT do the churning?

Smallpaul t1_jdw0vx9 wrote on March 27, 2023 at 4:40 PM

#2,394,900

It seems to me that if a researcher uses OpenAI to generate an open source Instruct dataset, and a different corporation takes that dataset and uses it commercially, they are both legally in the clear unless they collude. The entity that is legally in contact with OpenAI has a legitimately non-commercial purpose and the entity doing the commercial work has no relationship with OpenAI.

quitenominal t1_jdw15ao wrote on March 27, 2023 at 4:42 PM

#2,394,931

Replying to esquire900 (#2,394,790)

It's in the terms that you can't use data generated through OpenAI to compete with OpenAI - and I believe they'd be able to argue competition were the trained model to be used commercially.

See section 2.C.iii of https://openai.com/policies/terms-of-use

nullbyte420 t1_jdw1g9t wrote on March 27, 2023 at 4:44 PM

#2,394,969

Replying to esquire900 (#2,394,790)

And also against the terms of use

Taenk t1_jdw3pn3 wrote on March 27, 2023 at 4:58 PM

#2,395,274

https://open-assistant.io / /r/openassistant

esquire900 t1_jdwh7v8 wrote on March 27, 2023 at 6:24 PM

#2,397,088

Replying to quitenominal (#2,394,931)

Yea I was afraid so, just hadn't found it. Thank you for pointing that out :)

sad_dad_is_a_mad_lad t1_jdwhg8a wrote on March 27, 2023 at 6:25 PM

#2,397,126

OpenAI commercial use will not be easily enforced... They used copyright data to train their own models.

JohnyWalkerRed OP t1_jdwjvxy wrote on March 27, 2023 at 6:41 PM

#2,397,441

Replying to big_ol_tender (#2,394,008)

Yeah like the databricks dolly post is funny to me because they are an enterprise software company and dolly is not really useful in the context they operate in. I guess they just wanted to get some publicity.

Looks like openassist, when mature, could enable this. Although it seems the precursor to an Alpaca-like dataset is an RLHF model, which itself needs human-labeled dataset, so that bottleneck needs to be solved too.

Taenk t1_jdwlejh wrote on March 27, 2023 at 6:50 PM

#2,397,642

Replying to JohnyWalkerRed (#2,397,441)

The Open Assistant project is working on that as well.

ninjasaid13 t1_jdx9x3n wrote on March 27, 2023 at 9:28 PM

#2,401,089

Replying to Smallpaul (#2,394,900)

can you even copyright a dataset generated by an AI?

Smallpaul t1_jdxf3u3 wrote on March 27, 2023 at 10:03 PM

#2,401,861

Replying to ninjasaid13 (#2,401,089)

Probably not legally different than a document you created with a word processor.

learn-deeply t1_jdxgxsx wrote on March 27, 2023 at 10:16 PM

#2,402,109

Replying to Smallpaul (#2,401,861)

This isn't correct, at least in the US. AI-generated material is not considered copyrightable unless there has been significant human involvement.

https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence

rshah4 t1_jdxhesh wrote on March 27, 2023 at 10:19 PM

#2,402,188

It’s possible to pay one of the labeling companies for an instruction dataset. Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

rshah4 t1_jdxhz3d wrote on March 27, 2023 at 10:23 PM

#2,402,272

Replying to big_ol_tender (#2,394,008)

I agree with the sentiments here and don’t think it’s ok to use some of these datasets that appear to violate OpenAIs terms. I dealt with it by making a funny video: https://youtu.be/31u88EDmIwc

wind_dude t1_jdxqp0v wrote on March 27, 2023 at 11:27 PM

#2,403,600

Replying to Taenk (#2,395,274)

Last I checked they still hadn't opensourced the training data... which is bizarre since they used humans to train it, with all the talk of it being opensource.

wind_dude t1_jdxrcpp wrote on March 27, 2023 at 11:32 PM

#2,403,703

>depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use

Where do you get that? tatsu-lab/stanford_alpaca is apache 2.0, so you can use it for whatever.

for OpenAI

"""

(c) Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API...

"""

So as far as I'm concerned you are allowed to use the generated dataset for commercial purposes...

Only use might be the licensing on the llama models... but you can train another LLM

ninjasaid13 t1_jdxrw65 wrote on March 27, 2023 at 11:36 PM

#2,403,790

Replying to wind_dude (#2,403,600)

They're going to open source it on April 15 last I heard. They're still gathering with the cut off date at April 12.

kawin_e t1_jdxz4bh wrote on March 28, 2023 at 12:29 AM

#2,404,930

The Stanford Human Preferences dataset (SHP): https://huggingface.co/datasets/stanfordnlp/SHP

It contains pairwise preferences for posts (so tuples (post, response_A, response B)), but you can certainly turn it into an instruction dataset by only considering responses that meet a certain cut-off. I'm currently aware of one academic/industry group that is already doing this.

big_ol_tender t1_jdy0c6t wrote on March 28, 2023 at 12:38 AM

#2,405,146

Replying to sad_dad_is_a_mad_lad (#2,397,126)

100% agree but for those of us working for a company I can’t knowingly open us up to that risk even if the probability is 1%

ninjasaid13 t1_jdy2mgw wrote on March 28, 2023 at 12:55 AM

#2,405,509

Replying to rshah4 (#2,402,188)

>Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

see openassistant dataset that will be publicly released on april 15th for open-source.

ninjasaid13 t1_jdy2pqq wrote on March 28, 2023 at 12:56 AM

#2,405,516

Replying to kawin_e (#2,404,930)

>one academic/industry group

which one?

abnormal_human t1_jdyxteq wrote on March 28, 2023 at 5:18 AM

#2,410,193

Model weights are not currently considered to be copyrightable, and there is no DMCA/RIAA/MPAA machinery providing additional consequences for "pirating" them. At least for the moment, it's not a big risk to use LLaMA/Alpaca models for commercial use so long as you have not made an agreement with Facebook not to do it.

The OpenAI policy is about competing models, and comes from the TOS of using their API. Stanford agreed to that TOS, then released the text (which is again, not copyrightable). Random people downloading that data set aren't party to that agreement or bound by it.

I'm sure that Google, Facebook, Amazon, Netflix, etc will be cautious here, but for a random smaller org, this is a risk/benefit tradeoff, not an absolute.

A person who takes a torrented LLaMA and finetunes it using the Stanford data set didn't necessarily engage in any contracts prohibiting that.

The original leaker of LLaMA weights broke the rules. That's about it. Tsk tsk.

[deleted] t1_je5fdej wrote on March 29, 2023 at 3:25 PM

#2,450,367

[removed]

Raywuo t1_jeadybx wrote on March 30, 2023 at 4:06 PM

#2,489,344

Well data generated by GPT cannot be used on a new IA commercially, but what about data generated from an AI that was generated from GPT data? (2 levels of abstraction) haha

[deleted] t1_jeae8re wrote on March 30, 2023 at 4:08 PM

#2,489,425

Replying to Raywuo (#2,489,344)

[deleted]

Raywuo t1_jeal232 wrote on March 30, 2023 at 4:52 PM

#2,491,155

Replying to learn-deeply (#2,402,109)

Exactly, they could never appeal or they would be in contradiction with themselves.

lazybottle t1_jec8i0c wrote on March 30, 2023 at 11:19 PM

#2,505,507

Replying to wind_dude (#2,403,703)

Alpaca is not Apache 2.0

https://huggingface.co/datasets/tatsu-lab/alpaca#licensing-information

> The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0).

Edit: I see the source of confusion. https://github.com/tatsu-lab/stanford_alpaca

While the code is released under apache 2.0, the instruct dataset as pointed out by OP is not. One could potentially repro the steps, possibly with human ground truth, and release under a more amenable data license.

wind_dude t1_jec9lb4 wrote on March 30, 2023 at 11:27 PM

#2,505,764

Replying to lazybottle (#2,505,507)

Interesting I didn't realise the dataset was on HF with a different license. The dataset (https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) is also in the code repo which has the apache 2.0 license, so the dataset would be covered by it.