Comments

You must log in or register to comment.

Either-Job-341 t1_jcrhysc wrote

OpenAI API costs based on how many tokens you use, isn't that the case? Afaik, the fixed price (20$) is for the case when you're using it via UI (probably max one session).

24

MysteryInc152 t1_jcrnqc8 wrote

You can try training chatGLM. 6b parameters and initially trained on 1T English/Chinese Tokens. Also completely open source. However, it's already been fine tuned and had RLHF but that was optimized for Chinese Q/A. Could use some English work,

Another option is RWKV. There are 7b and 14b models(I would go with the 14b, it's the better of the two) fine tuned to a context length of 8196 tokens. He plans on increasing context further too.

17

A1-Delta t1_jcrpd05 wrote

Interesting project! I’ve seen many suggest that the training data for transfer learning might actually be the biggest thing holding Alpaca back from a ChatGPT like experience. In other words, that although the OpenAI model allows for the creation of a lot of training data, that data might include a lot of low quality pairs that in an ideal world wouldn’t be included. Do you have any plan to increase the quality of your dataset in addition to the size of it?

I hear your concern about the LLaMA license. It might be bad advice, but personally I wouldn’t worry about it. This is a very popular model people are using for all sorts of things. The chance they are going to come after you seems to me to be small and my understanding is that it’s sort of uncharted legal ground once you’ve done significant fine tuning. That being said, I’m not a lawyer.

LLaMA is a very powerful model and I would hate for you to put all this effort into creating something that ends up being limited and not clearly better than Alpaca simply because of license fears. If I were you though, I’d go with the 13B version. Still small enough to run on many high end consumer GPUs after quantization while providing significantly better baseline performance than the 7B version.

20

yehiaserag t1_jcru1ty wrote

Do you have a repo, a website, anywhere we can follow the progress of this?

3

noobgolang t1_jcrvlfl wrote

someone needs to take the plunge and release all of this altogether to the wild rather than this closed source nature

6

Craiglbl t1_jcrxjy2 wrote

ChatGLM is really good. I sometimes have a hard time distinguishing its Chinese outputs from those of chatgpt.

Sadly its English could use some improvement as it usually use Chinese adjectives when similar words are lacking in English.

8

RoyalCities t1_jcrxlvr wrote

I was talking to GPT 4 about this and it said that it seems plausible and can dramatically bring down costs.

It called it "knowledge distillation"

It also mentioned that if we had access to the weights from open ai you can use a process called model compression to scale down the hardware and put it on less powerful gpus or distributed gpus (like how render farms work)

This also explains why open ai is so cagey on releasing weights - the initial training cost is where the money sink is but once weights are out their is ways to make it run on cheaper hardware.

But Im wondering does this mean the smaller model can ONLY respond to the questions your generating or will it have latent knowledge outside of just the knowledge transfer? Like would say the smaller model thats trained off this approach also be able to answer questions on topics that are "restricted" in open ais view that you couldnt ask it or do you absolutely must need to get an initial answer for such restricted content for it to be able to produce a responce?

Talking about things like writing malicious code or what not. I dont plan on doing that obviously but Im curious on if this means that these smaller models will basically be totally unrestricted now or if its just trained on say tons of python code it can just create said malicious code from scratch without actually being exposed with examples of "how" to make it (since it has a greater knowledge of the ubderlying principals of python)

Edit: Okay guess it can per GPT 4.

Damn these things are fascinating.

>Yes, the same concerns can apply to a smaller model being trained from a larger one via knowledge distillation. Knowledge distillation is a technique where the smaller model learns to mimic the larger model's behavior by training on a dataset generated using the larger model's outputs. The smaller model effectively learns from the larger model's knowledge and understanding of language patterns and concepts.

>As a result, the smaller model can also gain latent knowledge about various topics and domains, even if it hasn't been explicitly exposed to specific examples during training. This means that the smaller model could potentially generate undesirable content based on its understanding of the relationships between words and concepts, similar to the larger model.

3

ThatInternetGuy t1_jcs253z wrote

You know ChatGPT and GPT4 licenses forbid using their output data for training competing AI models. What Stanford did was to show proof of concept for their paper, not to open-source the model, at all.

25

ReasonablyBadass t1_jcs32ea wrote

Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

6

frownGuy12 t1_jcsfnh7 wrote

If OpenAI wants people to respect their IP they should take the word “open” out of their name. They scraped our data to train their models after all, it’s not like OpenAI themselves aren’t pushing the boundaries of what’s acceptable when it comes to copyright law.

Legally it’s questionable, but ethically speaking I think it’s a fine idea.

53

Anjz t1_jcsktsf wrote

It's probably untested in courts, there's so many loopholes and variables too, what's considered a competing AI model? Companies usually just spew a bunch of stuff in their terms of use, some of which have no legal basis.

19

raduqq t1_jcslcbz wrote

I thought their ToS doesn't allow you to train another model on the output of their ChatGPT or GPT-4 or other models.

1

Long19980 t1_jcsm4ni wrote

Can I see your python script? How did you balance your programming language data between the various languages?

1

FaceDeer t1_jcsot55 wrote

All these weird restrictions and regulations seem pretty squirrelly to me.

Maybe this could be "laundered" by doing two separate projects. Have one project gather the 2 million question/response interactions into a big archive, which is then released publicly. Then some other project comes along and uses it for training, without directly interacting with ChatGPT itself.

I'm sure this won't really stop a lawsuit, but the more complicated it can be made for OpenAI to pursue it the less likely they are to go ahead.

5

Seromelhor t1_jcsov3a wrote

You can use NLLB from Facebook to translate the sentences from English to more than 200 other languages. That would be interesting.

1

lxe t1_jcsqmdi wrote

You should try fine tuning openchatkit — it’s Apache 2 licensed afaik. Or GPT-NEOX-20B if you have the hardware.

1

asraniel t1_jcsr22m wrote

wo how does that work? soon a good chunk of the internet will be text written by gpt (including wikipedia). does that mean going forward you cant legally use the internet as a datasource to train a llm?

5

ninjasaid13 t1_jcsth4w wrote

>Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

Generally, copyright requires human authorship. If the output of an AI model is solely generated by a machine without human input, it may not be eligible for copyright protection and fall under public domain.

4

objectdisorienting t1_jcsu3xk wrote

Will be interesting to see where lawmakers and courts ultimately land on this, but the current status quo is that AI generated text and images (or any other works) cannot be copyrighted. In other words for now all output is public domain and OpenAI can kick rocks on this. A TOS violation just means you might get banned from using their service lol.

1

Stock-Nebula2185 t1_jcsuzb5 wrote

You can also query Codex for free. It might not be as good at ChatGPT, but perhaps still worth trying?

2

ninjasaid13 t1_jcsv2oi wrote

It's what the copyright office said according to that midjourney comic that was being registered for copyright.

Since it was created by an AI the output cannot be registered for copyright and licensing doesn't hold power on something that's in public domain.

2

hapliniste t1_jcsxpna wrote

Nice 👍 good project, I'm impatient to see the result. It would be great to make a torrent of the dataset to avoid unnecessary costs in the future too

1

baffo32 t1_jcsy5mb wrote

Maybe set up the training code so different foundation models can be plugged in for finetuning and the it’s just compute if somebody wants a different starting model.

Note there are free interfaces to these models such as https://spellbook.scale.com/ . Also note there is a lot of data collected out there already.

0

starstruckmon t1_jct06xj wrote

  • There's a already a couple high quality instruction datasets/compilations like FLAN that I think should also be mixed in.

  • Be sure to check the generated dataset for issues. Might require some cleanup like the original did.

3

starstruckmon t1_jct0s11 wrote

They are. It's less to do with copyright and more to do with the fact that you signed the T&C before using their system ( and then broke ). It's simmilar to the LinkedIn data scraping case where the court ruled that it wasn't illegal for them to scrape ( nor did it violate copyright ) but they still got in trouble ( and had to settle ) because of violating the T&C.

One way around this is to have two parties, one generating and publishing the dataset ( doesn't violate T&C ) and another independant party ( who didn't sign the T&C ) fine-tuning a model on the dataset.

6

wywywywy t1_jct2wjz wrote

Are you doing a Lora or full weights?

> I wanted to train Meta's LLaMA model on this data, but considering their license, I'm not sure if that is the best way. Suggestions will be appreciated.

If we ignore OpenAI's licence, is it ok to perhaps ignore Meta's licence as well? Or is that going too far

> The trained model will be open source, under MIT License.

Is the dataset going to be open source as well? So that other people can use it to train other models.

1

[deleted] OP t1_jct6a1x wrote

Provide a clear and formal way for the community to express its opinion. You know, as opposed to tweeting at one person who does not have absolute control over Meta AI. Notable people have brought attention to that pull request and it is currently gaining traction.

2