Submitted by [deleted] t3_11v4h5z in MachineLearning
[deleted]
Submitted by [deleted] t3_11v4h5z in MachineLearning
[deleted]
RemindMe! 2 days
I will be messaging you in 2 days on 2023-03-21 00:12:32 UTC to remind you of this link
10 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
^(Info) | ^(Custom) | ^(Your Reminders) | ^(Feedback) |
---|
OpenAI API costs based on how many tokens you use, isn't that the case? Afaik, the fixed price (20$) is for the case when you're using it via UI (probably max one session).
[deleted]
You can try training chatGLM. 6b parameters and initially trained on 1T English/Chinese Tokens. Also completely open source. However, it's already been fine tuned and had RLHF but that was optimized for Chinese Q/A. Could use some English work,
Another option is RWKV. There are 7b and 14b models(I would go with the 14b, it's the better of the two) fine tuned to a context length of 8196 tokens. He plans on increasing context further too.
He's talking about the playground which is per token https://platform.openai.com/playground
Interesting project! I’ve seen many suggest that the training data for transfer learning might actually be the biggest thing holding Alpaca back from a ChatGPT like experience. In other words, that although the OpenAI model allows for the creation of a lot of training data, that data might include a lot of low quality pairs that in an ideal world wouldn’t be included. Do you have any plan to increase the quality of your dataset in addition to the size of it?
I hear your concern about the LLaMA license. It might be bad advice, but personally I wouldn’t worry about it. This is a very popular model people are using for all sorts of things. The chance they are going to come after you seems to me to be small and my understanding is that it’s sort of uncharted legal ground once you’ve done significant fine tuning. That being said, I’m not a lawyer.
LLaMA is a very powerful model and I would hate for you to put all this effort into creating something that ends up being limited and not clearly better than Alpaca simply because of license fears. If I were you though, I’d go with the 13B version. Still small enough to run on many high end consumer GPUs after quantization while providing significantly better baseline performance than the 7B version.
Do you have a repo, a website, anywhere we can follow the progress of this?
someone needs to take the plunge and release all of this altogether to the wild rather than this closed source nature
ChatGLM is really good. I sometimes have a hard time distinguishing its Chinese outputs from those of chatgpt.
Sadly its English could use some improvement as it usually use Chinese adjectives when similar words are lacking in English.
I was talking to GPT 4 about this and it said that it seems plausible and can dramatically bring down costs.
It called it "knowledge distillation"
It also mentioned that if we had access to the weights from open ai you can use a process called model compression to scale down the hardware and put it on less powerful gpus or distributed gpus (like how render farms work)
This also explains why open ai is so cagey on releasing weights - the initial training cost is where the money sink is but once weights are out their is ways to make it run on cheaper hardware.
But Im wondering does this mean the smaller model can ONLY respond to the questions your generating or will it have latent knowledge outside of just the knowledge transfer? Like would say the smaller model thats trained off this approach also be able to answer questions on topics that are "restricted" in open ais view that you couldnt ask it or do you absolutely must need to get an initial answer for such restricted content for it to be able to produce a responce?
Talking about things like writing malicious code or what not. I dont plan on doing that obviously but Im curious on if this means that these smaller models will basically be totally unrestricted now or if its just trained on say tons of python code it can just create said malicious code from scratch without actually being exposed with examples of "how" to make it (since it has a greater knowledge of the ubderlying principals of python)
Edit: Okay guess it can per GPT 4.
Damn these things are fascinating.
>Yes, the same concerns can apply to a smaller model being trained from a larger one via knowledge distillation. Knowledge distillation is a technique where the smaller model learns to mimic the larger model's behavior by training on a dataset generated using the larger model's outputs. The smaller model effectively learns from the larger model's knowledge and understanding of language patterns and concepts.
>As a result, the smaller model can also gain latent knowledge about various topics and domains, even if it hasn't been explicitly exposed to specific examples during training. This means that the smaller model could potentially generate undesirable content based on its understanding of the relationships between words and concepts, similar to the larger model.
You know ChatGPT and GPT4 licenses forbid using their output data for training competing AI models. What Stanford did was to show proof of concept for their paper, not to open-source the model, at all.
Careful. That MIT license won't work, I think, thanks to ClosedAIs licences
I think the new model gets most of its knowledge from its original model and the training is mostly about how to act like a RLHF model.
If OpenAI wants people to respect their IP they should take the word “open” out of their name. They scraped our data to train their models after all, it’s not like OpenAI themselves aren’t pushing the boundaries of what’s acceptable when it comes to copyright law.
Legally it’s questionable, but ethically speaking I think it’s a fine idea.
For those who wish for LLaMA to become truly open source, please vote on this:
Is OpenAI actually legally allowed to do that? How is using their model for training different from training on copyrighted data which all these models do?
It's probably untested in courts, there's so many loopholes and variables too, what's considered a competing AI model? Companies usually just spew a bunch of stuff in their terms of use, some of which have no legal basis.
I thought their ToS doesn't allow you to train another model on the output of their ChatGPT or GPT-4 or other models.
They can go cry about it.
Can I see your python script? How did you balance your programming language data between the various languages?
I'd say enjoy it while it lasts, at the very least
Would UL2 be a good basis?
The output of AI can't be copyrighted so OpenAI has no say in what somebody does with the output.
All these weird restrictions and regulations seem pretty squirrelly to me.
Maybe this could be "laundered" by doing two separate projects. Have one project gather the 2 million question/response interactions into a big archive, which is then released publicly. Then some other project comes along and uses it for training, without directly interacting with ChatGPT itself.
I'm sure this won't really stop a lawsuit, but the more complicated it can be made for OpenAI to pursue it the less likely they are to go ahead.
You can use NLLB from Facebook to translate the sentences from English to more than 200 other languages. That would be interesting.
Copyright and license terms are different things.
You should try fine tuning openchatkit — it’s Apache 2 licensed afaik. Or GPT-NEOX-20B if you have the hardware.
If doesn't matter what the license terms say if it can't be enforced.
wo how does that work? soon a good chunk of the internet will be text written by gpt (including wikipedia). does that mean going forward you cant legally use the internet as a datasource to train a llm?
can you finetune OpenAssistant's model? https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b I hear their pythia language model isn't very good or maybe gpt-j which is more consumer grade friendly.
[removed]
>Careful. That MIT license won't work, I think, thanks to ClosedAIs licences
Generally, copyright requires human authorship. If the output of an AI model is solely generated by a machine without human input, it may not be eligible for copyright protection and fall under public domain.
Not sure how much this is established law.
Anyway, Alpaca says so themselves on their website: https://crfm.stanford.edu/2023/03/13/alpaca.html
Will be interesting to see where lawmakers and courts ultimately land on this, but the current status quo is that AI generated text and images (or any other works) cannot be copyrighted. In other words for now all output is public domain and OpenAI can kick rocks on this. A TOS violation just means you might get banned from using their service lol.
You can also query Codex for free. It might not be as good at ChatGPT, but perhaps still worth trying?
It's what the copyright office said according to that midjourney comic that was being registered for copyright.
Since it was created by an AI the output cannot be registered for copyright and licensing doesn't hold power on something that's in public domain.
Train a model that's designated as non-competing but open, then train another model from the output of that that's competing.
Nice 👍 good project, I'm impatient to see the result. It would be great to make a torrent of the dataset to avoid unnecessary costs in the future too
Maybe set up the training code so different foundation models can be plugged in for finetuning and the it’s just compute if somebody wants a different starting model.
Note there are free interfaces to these models such as https://spellbook.scale.com/ . Also note there is a lot of data collected out there already.
No they aren’t , they have no claim on transformers that would be google brain , but you don’t see alphabet throwing a sissy fit
I thought they didn't like ppl using the UI as an API? Hopefully you don't get banned
He's using the actual API.
There's a already a couple high quality instruction datasets/compilations like FLAN that I think should also be mixed in.
Be sure to check the generated dataset for issues. Might require some cleanup like the original did.
Just publish the diff. between the original model and the finetuned model. That's what a lot of people are doing to avoid any license issues.
They are. It's less to do with copyright and more to do with the fact that you signed the T&C before using their system ( and then broke ). It's simmilar to the LinkedIn data scraping case where the court ruled that it wasn't illegal for them to scrape ( nor did it violate copyright ) but they still got in trouble ( and had to settle ) because of violating the T&C.
One way around this is to have two parties, one generating and publishing the dataset ( doesn't violate T&C ) and another independant party ( who didn't sign the T&C ) fine-tuning a model on the dataset.
It's not about copyright
https://www.reddit.com/r/MachineLearning/comments/11v4h5z/-/jct0s11
Are you doing a Lora or full weights?
> I wanted to train Meta's LLaMA model on this data, but considering their license, I'm not sure if that is the best way. Suggestions will be appreciated.
If we ignore OpenAI's licence, is it ok to perhaps ignore Meta's licence as well? Or is that going too far
> The trained model will be open source, under MIT License.
Is the dataset going to be open source as well? So that other people can use it to train other models.
I think you'd be dismayed at how easy it is to enforce these things when you have OpenAI money.
It's most likely enforceable, but even if it's not they can simply ban OP for doing that. if OP is using their API in any way that's important to him, it's something to consider.
and what does voting there do to make it open source? Lecun already knows that majority of people don't like this licensing as people were tweeting that at him since llama release...
Provide a clear and formal way for the community to express its opinion. You know, as opposed to tweeting at one person who does not have absolute control over Meta AI. Notable people have brought attention to that pull request and it is currently gaining traction.
which can make a huge difference: GPT-3 + RLHF = Chat-GPT
Couldnt it be possible to set up a large community Q/A repositiry then? Just crowdsource whatever it outputs and document collectively.
very sad the post was deleted and his account was deleted. I wonder if he did this intentionally or not.
very sad considering his account was deleted. I hope he still finds a way to post his results (if he decides to still go through with the idea)
RWKV is recurrent right? Why is it token limited?
[removed]
so, open assistant?
kross00 t1_jcre2hi wrote
I'm a newbie... but maybe take a look at this model: https://github.com/BlinkDL/RWKV-LM