kross00 t1_jcre2hi wrote on March 18, 2023 at 11:47 PM

#2,266,499

I'm a newbie... but maybe take a look at this model: https://github.com/BlinkDL/RWKV-LM

Either-Job-341 t1_jcrhe3r wrote on March 19, 2023 at 12:12 AM

#2,266,637

RemindMe! 2 days

RemindMeBot t1_jcrhfug wrote on March 19, 2023 at 12:12 AM

#2,266,642

Replying to Either-Job-341 (#2,266,637)

I will be messaging you in 2 days on 2023-03-21 00:12:32 UTC to remind you of this link

10 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

Either-Job-341 t1_jcrhysc wrote on March 19, 2023 at 12:16 AM

#2,266,671

OpenAI API costs based on how many tokens you use, isn't that the case? Afaik, the fixed price (20$) is for the case when you're using it via UI (probably max one session).

[deleted] OP t1_jcrnbs3 wrote on March 19, 2023 at 12:58 AM

#2,266,908

Replying to Either-Job-341 (#2,266,671)

[deleted]

MysteryInc152 t1_jcrnqc8 wrote on March 19, 2023 at 1:01 AM

#2,266,934

You can try training chatGLM. 6b parameters and initially trained on 1T English/Chinese Tokens. Also completely open source. However, it's already been fine tuned and had RLHF but that was optimized for Chinese Q/A. Could use some English work,

Another option is RWKV. There are 7b and 14b models(I would go with the 14b, it's the better of the two) fine tuned to a context length of 8196 tokens. He plans on increasing context further too.

MysteryInc152 t1_jcro0q5 wrote on March 19, 2023 at 1:03 AM

#2,266,944

Replying to Either-Job-341 (#2,266,671)

He's talking about the playground which is per token https://platform.openai.com/playground

A1-Delta t1_jcrpd05 wrote on March 19, 2023 at 1:13 AM

#2,267,025

Interesting project! I’ve seen many suggest that the training data for transfer learning might actually be the biggest thing holding Alpaca back from a ChatGPT like experience. In other words, that although the OpenAI model allows for the creation of a lot of training data, that data might include a lot of low quality pairs that in an ideal world wouldn’t be included. Do you have any plan to increase the quality of your dataset in addition to the size of it?

I hear your concern about the LLaMA license. It might be bad advice, but personally I wouldn’t worry about it. This is a very popular model people are using for all sorts of things. The chance they are going to come after you seems to me to be small and my understanding is that it’s sort of uncharted legal ground once you’ve done significant fine tuning. That being said, I’m not a lawyer.

LLaMA is a very powerful model and I would hate for you to put all this effort into creating something that ends up being limited and not clearly better than Alpaca simply because of license fears. If I were you though, I’d go with the 13B version. Still small enough to run on many high end consumer GPUs after quantization while providing significantly better baseline performance than the 7B version.

yehiaserag t1_jcru1ty wrote on March 19, 2023 at 1:51 AM

#2,267,263

Do you have a repo, a website, anywhere we can follow the progress of this?

noobgolang t1_jcrvlfl wrote on March 19, 2023 at 2:03 AM

#2,267,342

someone needs to take the plunge and release all of this altogether to the wild rather than this closed source nature

Craiglbl t1_jcrxjy2 wrote on March 19, 2023 at 2:19 AM

#2,267,429

Replying to MysteryInc152 (#2,266,934)

ChatGLM is really good. I sometimes have a hard time distinguishing its Chinese outputs from those of chatgpt.

Sadly its English could use some improvement as it usually use Chinese adjectives when similar words are lacking in English.

RoyalCities t1_jcrxlvr wrote on March 19, 2023 at 2:19 AM

#2,267,433

I was talking to GPT 4 about this and it said that it seems plausible and can dramatically bring down costs.

It called it "knowledge distillation"

It also mentioned that if we had access to the weights from open ai you can use a process called model compression to scale down the hardware and put it on less powerful gpus or distributed gpus (like how render farms work)

This also explains why open ai is so cagey on releasing weights - the initial training cost is where the money sink is but once weights are out their is ways to make it run on cheaper hardware.

But Im wondering does this mean the smaller model can ONLY respond to the questions your generating or will it have latent knowledge outside of just the knowledge transfer? Like would say the smaller model thats trained off this approach also be able to answer questions on topics that are "restricted" in open ais view that you couldnt ask it or do you absolutely must need to get an initial answer for such restricted content for it to be able to produce a responce?

Talking about things like writing malicious code or what not. I dont plan on doing that obviously but Im curious on if this means that these smaller models will basically be totally unrestricted now or if its just trained on say tons of python code it can just create said malicious code from scratch without actually being exposed with examples of "how" to make it (since it has a greater knowledge of the ubderlying principals of python)

Edit: Okay guess it can per GPT 4.

Damn these things are fascinating.

>Yes, the same concerns can apply to a smaller model being trained from a larger one via knowledge distillation. Knowledge distillation is a technique where the smaller model learns to mimic the larger model's behavior by training on a dataset generated using the larger model's outputs. The smaller model effectively learns from the larger model's knowledge and understanding of language patterns and concepts.

>As a result, the smaller model can also gain latent knowledge about various topics and domains, even if it hasn't been explicitly exposed to specific examples during training. This means that the smaller model could potentially generate undesirable content based on its understanding of the relationships between words and concepts, similar to the larger model.

yehiaserag t1_jcrz2qt wrote on March 19, 2023 at 2:31 AM

#2,267,514

Replying to yehiaserag (#2,267,263)

Just saw https://www.reddit.com/r/MachineLearning/comments/11v261n/p_d_datasetgpt_a_commandline_tool_to_generate/

seems very similar

ThatInternetGuy t1_jcs253z wrote on March 19, 2023 at 2:57 AM

#2,267,672

You know ChatGPT and GPT4 licenses forbid using their output data for training competing AI models. What Stanford did was to show proof of concept for their paper, not to open-source the model, at all.

ReasonablyBadass t1_jcs32ea wrote on March 19, 2023 at 3:05 AM

#2,267,722

Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

Smallpaul t1_jcsah9r wrote on March 19, 2023 at 4:13 AM

#2,268,143

Replying to RoyalCities (#2,267,433)

I think the new model gets most of its knowledge from its original model and the training is mostly about how to act like a RLHF model.

frownGuy12 t1_jcsfnh7 wrote on March 19, 2023 at 5:07 AM

#2,268,432

Replying to ThatInternetGuy (#2,267,672)

If OpenAI wants people to respect their IP they should take the word “open” out of their name. They scraped our data to train their models after all, it’s not like OpenAI themselves aren’t pushing the boundaries of what’s acceptable when it comes to copyright law.

Legally it’s questionable, but ethically speaking I think it’s a fine idea.

[deleted] OP t1_jcsjd9y wrote on March 19, 2023 at 5:51 AM

#2,268,586

For those who wish for LLaMA to become truly open source, please vote on this:

https://github.com/facebookresearch/llama/pull/184

throwaway957280 t1_jcsjj07 wrote on March 19, 2023 at 5:53 AM

#2,268,590

Replying to ThatInternetGuy (#2,267,672)

Is OpenAI actually legally allowed to do that? How is using their model for training different from training on copyrighted data which all these models do?

Anjz t1_jcsktsf wrote on March 19, 2023 at 6:09 AM

#2,268,639

Replying to throwaway957280 (#2,268,590)

It's probably untested in courts, there's so many loopholes and variables too, what's considered a competing AI model? Companies usually just spew a bunch of stuff in their terms of use, some of which have no legal basis.

raduqq t1_jcslcbz wrote on March 19, 2023 at 6:16 AM

#2,268,673

I thought their ToS doesn't allow you to train another model on the output of their ChatGPT or GPT-4 or other models.

Long19980 t1_jcsllwx wrote on March 19, 2023 at 6:20 AM

#2,268,690

Replying to ThatInternetGuy (#2,267,672)

They can go cry about it.

Long19980 t1_jcsm4ni wrote on March 19, 2023 at 6:27 AM

#2,268,712

Can I see your python script? How did you balance your programming language data between the various languages?

kex t1_jcsm7kh wrote on March 19, 2023 at 6:28 AM

#2,268,716

Replying to Anjz (#2,268,639)

I'd say enjoy it while it lasts, at the very least

assimil8or t1_jcsnwuh wrote on March 19, 2023 at 6:51 AM

#2,268,772

Would UL2 be a good basis?

yaosio t1_jcsob5z wrote on March 19, 2023 at 6:56 AM

#2,268,785

Replying to ThatInternetGuy (#2,267,672)

The output of AI can't be copyrighted so OpenAI has no say in what somebody does with the output.

FaceDeer t1_jcsot55 wrote on March 19, 2023 at 7:03 AM

#2,268,798

Replying to raduqq (#2,268,673)

All these weird restrictions and regulations seem pretty squirrelly to me.

Maybe this could be "laundered" by doing two separate projects. Have one project gather the 2 million question/response interactions into a big archive, which is then released publicly. Then some other project comes along and uses it for training, without directly interacting with ChatGPT itself.

I'm sure this won't really stop a lawsuit, but the more complicated it can be made for OpenAI to pursue it the less likely they are to go ahead.

Seromelhor t1_jcsov3a wrote on March 19, 2023 at 7:04 AM

#2,268,802

You can use NLLB from Facebook to translate the sentences from English to more than 200 other languages. That would be interesting.

lxe t1_jcsqk7t wrote on March 19, 2023 at 7:29 AM

#2,268,857

Replying to yaosio (#2,268,785)

Copyright and license terms are different things.

lxe t1_jcsqmdi wrote on March 19, 2023 at 7:30 AM

#2,268,862

You should try fine tuning openchatkit — it’s Apache 2 licensed afaik. Or GPT-NEOX-20B if you have the hardware.

yaosio t1_jcsqxwf wrote on March 19, 2023 at 7:35 AM

#2,268,870

Replying to lxe (#2,268,857)

If doesn't matter what the license terms say if it can't be enforced.

asraniel t1_jcsr22m wrote on March 19, 2023 at 7:36 AM

#2,268,876

Replying to raduqq (#2,268,673)

wo how does that work? soon a good chunk of the internet will be text written by gpt (including wikipedia). does that mean going forward you cant legally use the internet as a datasource to train a llm?

ninjasaid13 t1_jcssyvo wrote on March 19, 2023 at 8:04 AM

#2,268,944

can you finetune OpenAssistant's model? https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b I hear their pythia language model isn't very good or maybe gpt-j which is more consumer grade friendly.

[deleted] OP t1_jcst826 wrote on March 19, 2023 at 8:08 AM

#2,268,954

[removed]

ninjasaid13 t1_jcsth4w wrote on March 19, 2023 at 8:12 AM

#2,268,961

Replying to ReasonablyBadass (#2,267,722)

>Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

Generally, copyright requires human authorship. If the output of an AI model is solely generated by a machine without human input, it may not be eligible for copyright protection and fall under public domain.

ReasonablyBadass t1_jcsu1yv wrote on March 19, 2023 at 8:20 AM

#2,268,976

Replying to ninjasaid13 (#2,268,961)

Not sure how much this is established law.

Anyway, Alpaca says so themselves on their website: https://crfm.stanford.edu/2023/03/13/alpaca.html

objectdisorienting t1_jcsu3xk wrote on March 19, 2023 at 8:21 AM

#2,268,979

Replying to ThatInternetGuy (#2,267,672)

Will be interesting to see where lawmakers and courts ultimately land on this, but the current status quo is that AI generated text and images (or any other works) cannot be copyrighted. In other words for now all output is public domain and OpenAI can kick rocks on this. A TOS violation just means you might get banned from using their service lol.

Stock-Nebula2185 t1_jcsuzb5 wrote on March 19, 2023 at 8:33 AM

#2,269,010

You can also query Codex for free. It might not be as good at ChatGPT, but perhaps still worth trying?

ninjasaid13 t1_jcsv2oi wrote on March 19, 2023 at 8:35 AM

#2,269,013

Replying to ReasonablyBadass (#2,268,976)

It's what the copyright office said according to that midjourney comic that was being registered for copyright.

Since it was created by an AI the output cannot be registered for copyright and licensing doesn't hold power on something that's in public domain.

hughperman t1_jcswzfh wrote on March 19, 2023 at 9:02 AM

#2,269,084

Replying to Anjz (#2,268,639)

Train a model that's designated as non-competing but open, then train another model from the output of that that's competing.

hapliniste t1_jcsxpna wrote on March 19, 2023 at 9:12 AM

#2,269,102

Nice 👍 good project, I'm impatient to see the result. It would be great to make a torrent of the dataset to avoid unnecessary costs in the future too

baffo32 t1_jcsy5mb wrote on March 19, 2023 at 9:19 AM

#2,269,116

Maybe set up the training code so different foundation models can be plugged in for finetuning and the it’s just compute if somebody wants a different starting model.

Note there are free interfaces to these models such as https://spellbook.scale.com/ . Also note there is a lot of data collected out there already.

bitchslayer78 t1_jcsz4s3 wrote on March 19, 2023 at 9:34 AM

#2,269,145

Replying to throwaway957280 (#2,268,590)

No they aren’t , they have no claim on transformers that would be google brain , but you don’t see alphabet throwing a sissy fit

Damitrix t1_jcsz5j2 wrote on March 19, 2023 at 9:34 AM

#2,269,146

Replying to [deleted] (#2,266,908)

I thought they didn't like ppl using the UI as an API? Hopefully you don't get banned

starstruckmon t1_jcszqtg wrote on March 19, 2023 at 9:42 AM

#2,269,164

Replying to Damitrix (#2,269,146)

He's using the actual API.

starstruckmon t1_jct06xj wrote on March 19, 2023 at 9:49 AM

#2,269,176

There's a already a couple high quality instruction datasets/compilations like FLAN that I think should also be mixed in.
Be sure to check the generated dataset for issues. Might require some cleanup like the original did.

starstruckmon t1_jct0dxy wrote on March 19, 2023 at 9:51 AM

#2,269,179

Replying to A1-Delta (#2,267,025)

Just publish the diff. between the original model and the finetuned model. That's what a lot of people are doing to avoid any license issues.

starstruckmon t1_jct0s11 wrote on March 19, 2023 at 9:57 AM

#2,269,192

Replying to throwaway957280 (#2,268,590)

They are. It's less to do with copyright and more to do with the fact that you signed the T&C before using their system ( and then broke ). It's simmilar to the LinkedIn data scraping case where the court ruled that it wasn't illegal for them to scrape ( nor did it violate copyright ) but they still got in trouble ( and had to settle ) because of violating the T&C.

One way around this is to have two parties, one generating and publishing the dataset ( doesn't violate T&C ) and another independant party ( who didn't sign the T&C ) fine-tuning a model on the dataset.

starstruckmon t1_jct0v0k wrote on March 19, 2023 at 9:58 AM

#2,269,195

Replying to yaosio (#2,268,785)

It's not about copyright

https://www.reddit.com/r/MachineLearning/comments/11v4h5z/-/jct0s11

wywywywy t1_jct2wjz wrote on March 19, 2023 at 10:27 AM

#2,269,262

Are you doing a Lora or full weights?

> I wanted to train Meta's LLaMA model on this data, but considering their license, I'm not sure if that is the best way. Suggestions will be appreciated.

If we ignore OpenAI's licence, is it ok to perhaps ignore Meta's licence as well? Or is that going too far

> The trained model will be open source, under MIT License.

Is the dataset going to be open source as well? So that other people can use it to train other models.

Uptown-Dog t1_jct32n7 wrote on March 19, 2023 at 10:30 AM

#2,269,271

Replying to yaosio (#2,268,870)

I think you'd be dismayed at how easy it is to enforce these things when you have OpenAI money.

VertexMachine t1_jct3b51 wrote on March 19, 2023 at 10:33 AM

#2,269,282

Replying to ThatInternetGuy (#2,267,672)

It's most likely enforceable, but even if it's not they can simply ban OP for doing that. if OP is using their API in any way that's important to him, it's something to consider.

VertexMachine t1_jct3jwb wrote on March 19, 2023 at 10:37 AM

#2,269,293

Replying to [deleted] (#2,268,586)

and what does voting there do to make it open source? Lecun already knows that majority of people don't like this licensing as people were tweeting that at him since llama release...

[deleted] OP t1_jct6a1x wrote on March 19, 2023 at 11:14 AM

#2,269,392

Replying to VertexMachine (#2,269,293)

Provide a clear and formal way for the community to express its opinion. You know, as opposed to tweeting at one person who does not have absolute control over Meta AI. Notable people have brought attention to that pull request and it is currently gaining traction.