Submitted by besabestin t3_10lp3g4 in MachineLearning

I have two questions about chatGPT. I don't come from a machine learning background. I am just a programmer. So bear with me if they sound a bit dumb.

I was checking about chatGPT a bit the last week. I went through their papers and also tried out a fine tuning by myself by creating some fictional world and giving it some examples.

The first thing I wondered is what is very special about the model than the large data and parameter set it has, that other competitors can't do. I ask this because I have seen a lot of "google killer" discussions in some places. From what I understood from their papers I thought it is something another company with the computing power and the filtered data can have up and running in few months. I see their advantage in rolling out to the public because with feedbacks from actual users all over the world it can potentially be retrained.

The second thing I wondered is its scalability. It feels to me that it is a very big challenge to keep it scalable in the future. Currently getting a long text out of it is kind of painful because it has to continuously generate. I think it is continuously calculating with the huge parameter set it has. I wonder also about new trends, if it needs to be retrained. I also used it for a fine tuning, where I created a fictional world with its own law and rules and the fine tuning took hours in the queue - so is it creating separate parameters for my case? that would be a lot considering how much parameter set they have.

25

Comments

You must log in or register to comment.

manubfr t1_j5y6wko wrote

Google (and DeepMind) actually have better LLM tech and models than OpenAI (if you believe their published research anyway). They had a significant breathrough last year in terms of scalability: https://arxiv.org/abs/2203.15556

Existing LLMs are found out to be undertrained and with some tweaks you can create a smaller model that outperforms larger ones. Chinchilla is arguably the most performant model we've heard of to date ( https://www.jasonwei.net/blog/emergence ) but it hasn't been pushed to any consumer-facing application AFAIK.

This should be powering their ChatGPT competitor Sparrow which might be reeleased this year. I am pretty sure that OpenAI will also implement those ideas for GPT-4.

44

vivehelpme t1_j5y70zt wrote

>what is very special about the model than the large data and parameter set it has

OpenAI have a good marketing department and the web interface is user friendly. But yeah there's really no secret sauce to it.

The model generates the text snippet in a batch, it just prints it a character at a time for dramatic effect(and to keep you occupied for a while so you don't overload the horribly computationally expensive cloud service it runs on with multiple queries in quick succession), so yeah definitely scaling questions before it could be ran as a google replacement general casual search engine.

23

CKtalon t1_j5y87e5 wrote

People often quote Chinchilla about performance, claiming that there's still a lot of performance to be unlocked when we do not know how GPT 3.5 was trained. GPT 3.5 could very well be Chinchilla-optimal, even though the 1st version of davinci was not Chinchilla-optimal. We know that OpenAI has retrained GPT 3 due to the increased context length going from 2048 to 4096 to the apparent 8000ish tokens for ChatGPT.

9

manubfr t1_j5y8mo0 wrote

You're right, it could be that 3.5 is already using that approach. I guess the emergent cognition tests haven't yet been published for GPT-3.5 (or have they?) so it's hard for us to measure performance as individuals. I guess someone could test text-davinci-003 on a bunch of cognitive tasks on the PlayGround but I'm far too lazy to do that :)

2

besabestin OP t1_j5ya2af wrote

I see. Interesting. I thought it was generating one by one like that. I wonder why it sometimes encounters error after generating a long text and just stops half way through the task - which happened to me frequently.

7

crt09 t1_j5ytazq wrote

the guy above was kind of unclear, its an autoregressive langauge model so it does generate one at a time, puts it back into the input and generates the next one. It could be printed out in one go once they waitied for it to stop and then be sent to the client and pritned all at once but they went with the fancy GUI type, possibly yeah as a way to slow down spamming

21

Dendriform1491 t1_j5ywgiz wrote

Also, Google doesn't use GPUs, they designed their own cards which they call TPUs.

TPUs are ASICs designed specifically for machine learning, they don't have any graphics related components, they are cheaper to make, use less energy and can make as many as they want.

15

cdsmith t1_j5z0rrm wrote

You don't have to be Google to use special-purpose hardware for machine learning, either. I work for a company (Groq) that makes a machine learning acceleration chip available to anyone. Groq has competitors, like SambaNova and Cerebras, with different architectures.

13

andreichiffa t1_j60625r wrote

So. First of all it’s not the size, or at least not only the size.

Before ChatGPT OpenAI experimented with InstructGPT, which at 6B parameters completely destroyed the 175B GPT3 when it came to satisfying users interacting with it and not being completely psycho.

Code-generating abilities start around 12B parameters (OpenAI codex), so most of things you are interacting with and are impressed by could be done with 12B parameters model. What really is doing heavy lifting for Chat-GPT is fine-tuning and guided generation to make it conform to user’s expectations.

Now, the model size allows for nice emerging properties, but there is a relationship between the dataset size and model size, meaning that without increasing the dataset, bigger model do nothing better. At 175B parameters, GPT-3 was already past that point compared to the curated dataset OpenAI used for it. And given that their dataset already contained CommonCrawl, it was pretty much all public writing on the internet.

They weren’t short by a bit - over a factor of 10x. Finding enough data to just finish training GPT-3 is a challenge already; larger models would need even more. That’s why they could dump code and more text into GPT-3 to create GPT-3.5 without creating bottlenecks.

Now, alternative models to GPT-3 have been trained (OPT175B or BLOOM), but at least for OPT175, it underperforms. OpenAI actually did a lot of data preparation, meaning that anyone who would want to replicate it would need to figure out the “secret sauce”.

7

londons_explorer t1_j60m5ui wrote

This isn't true.

The model generates 1 token at a time, and if you look at the network connection you can see it slowly loading the response.

I'm pretty sure the speed the answer is returned is as fast as openAI can generate it on their cluster of GPU's.

7

cdsmith t1_j60q0bs wrote

I can only answer about Groq. I'm not trying to sell you Groq hardware, honestly... I just honestly don't know the answers for other accelerator chips.

Groq very likely increases inference speed and power efficiency over GPUs; that's actually its main purpose. How much depends on the model, though. I'm not in marketing so I probably don't have the best resources here, but there are some general performance numbers (unfortunately no comparisons) in this article, and this one talks about a very specific case where a Groq chip gets you a 1000x inference performance advantage over the A100.

To run a model on a Groq chip, you would typically start before CUDA enters the picture at all, and convert from PyTorch, Tensorflow, or a model in several other common formats into a Groq program using https://github.com/groq/groqflow. If you have custom-written CUDA code, then it's likely you've got some programming work ahead of you to run on something besides a GPU.

7

ObjectManagerManager t1_j60y1rn wrote

OpenAI's LLM is special because it's open to the public. That's it. Other tech companies' internal LLMs are likely better. Google has a whole database of billions of websites and indexes directly at their disposal; I'm quite confident that they can outperform ChatGPT with ease. If Google was really afraid of ChatGPT running them out of business, they'd just release a public API for their own, better model. And they have a monopoly over the internet in terms of raw data and R&D; it would be virtually impossible for anyone else to compete.

Besides that, the whole "Google killer" thing is overreactive, IMO. The public api for ChatGPT doesn't retrain or even prompt-condition on new public internet data. So if you ask it about recent news, it'll spit out utter garbage. An internal version reportedly does seek out and retrain on new public internet data. But how does it find that data? With a neat tool that constantly crawls the web and builds large, efficient databases and indexes. Oh yeah---that's called a search engine.

So even if end users start using LLMs as a substitute for search engines (which is generally not happening at the moment, and it seems unlikely to be a concern in the age of GPT-3, despite what many people believe), most LLM queries will likely be forwarded to some search engine or another for prompt conditioning. Search engines will not die---they'll just have to adapt to be useful for LLM prompt conditioning in addition to being useful to end users.

17

lucidrage t1_j61so7l wrote

>convert from PyTorch, Tensorflow, or a model in several other common formats into a Groq program

Are there any effort spend in adding a plugin for a high level framework like keras to automatically use groq?

1

gradientpenalty t1_j61tko2 wrote

Okay, so where can I buy it as a small startup for under 10k without signing any NDA for using your proprietary compiler. As far as I can see, we are all still stuck with Nvidia after 10B of funding for all these "AI" hardware startup.

3

lucidrage t1_j61u7zt wrote

>that's called a
>
>search engine
>
>.

like bing? :D

Google isn't known to develop and keep new products. When that google engineer leaked that "sentient AI" model, why didn't google beat the news by releasing a google-gpt with search engine capabilities?

With their 150k engineers, I doubt they lack the resources to build a user-friendly version of their LLM so how come they've been sitting on their hands the whole time?

3

binheap t1_j61v2f2 wrote

If you believe them, model safety is why there isn't a general public release. LLMs (including chatGPT) tend to be bad at factual accuracy and can easily hallucinate. It's not obvious that you can work LLMs into a product where accuracy matters a lot. It might hurt brand image in ways that Google could not tolerate but OpenAI can tolerate.

4

cdsmith t1_j626c0c wrote

I honestly don't know the price or terms of use, for this or any other company. I'm not in sales or marketing at all. I said you don't need to be Google; obviously you have to have some amount of money, whether you're buying a GPU or some other piece of hardware.

1

jobeta t1_j62eibb wrote

IMHO the buzz is mainly around the UX provided by ChatGPT. Most LLMs are not that easily accessible and most people never get to experience any aha moment with them, so most people don't care. As for Google, I do think there is real but not immediate danger for their business model. The big issue for them is that 60% of their revenue comes from ads in Google search, so rolling out an amazing ChatGPT equivalent could potentially hurt their business. They would have to rethink the entire model. For now and AFAIK, ChatGPT doesn't provide web links so it doesn't feel like it is trying to sell you something. If Google if going to use one of their SOTA LLM and build a conversational AI out of it and make it available for free, surely they have to consider the implications for Alphabet as a whole.

3

Mountain_Lab_5857 t1_j65ojup wrote

You can check Damien Benveniste on Linkedin, i dont remember when its shared but there is a article about Model Parallelism for training.

1

golongandprosper t1_j67dc19 wrote

I read an article that it’s so good because they hired “almost slaves” at lowest possible price.. $2 was the rate.. don’t know if that’s per day or hour.. from some downtrodden country.

And hundreds to thousands of these serfs spent their days testing and manually training it. So they apparently got hundreds of thousands of hours of human manual training, at a price that many Americans could afford by taking a mortgage against their house- and apparently they are still there manually watching and reacting to queries in real time to verify answers are decent.. while the rest of the world gives them more data for free.

So when it says the servers are busy, to wait? That could mean the humans are busy ;p

1

visarga t1_j6bzixy wrote

> without increasing the dataset, bigger model do nothing better

Wrong, bigger models are better than small models even when both are trained on exactly the same data. Bigger models reach the same accuracy using fewer examples. Sometimes using a bigger model is the solution to having too little data.

0

visarga t1_j6c01ua wrote

> But yeah there's really no secret sauce to it.

Of course there is - it's data. They keep their mix of primary training sets with organic text, multi-task fine-tuning, code training and RLHF secret. We know only in general lines what they are doing, but details matter. How much code did they train on? it matters. How many tasks? 1800 like FLAN T5 or much more, like 10,000? We have no idea. Do they reuse the prompts to generate more training data? Possibly. Others don't have their API logs because they had no demo.

1

visarga t1_j6c0e8m wrote

They might use a second model to flag abuse, not once every token, but once every line or phrase. Their models are already trained to avoid being abused, but this second model is like insurance in case the main one doesn't work.

1

visarga t1_j6c0o3e wrote

I very much doubt they do this in real time. The model is responding too fast for that.

They are probably used for RLHF model alignment: to keep it polite, helpful and harmless, and to generate more samples of tasks being solved by vetting our chatGPT interaction logs, or using the model from the console like us to solve tasks, or effectively writing the answers themselves where the model fails.

1

andreichiffa t1_j6c9xf1 wrote

That’s a very bold claim that flies in the face of pretty much all the research on the subject to the date.

Surely you have extraordinary evidence to support such extraordinary claims?

1

gamerx88 t1_j6cqerx wrote

It's not about large data or number of parameters. OpenAI has not actually revealed details regarding ChatGPT's architecture and training. What is special is the fine-tuning procedure -- alignment through RLHF on the underlying LLM (nicknamed GPT3.5) that is extremely good at giving "useful" responses to prompts\instructions.

Prior to this innovation, zero-shot and in-context few-shot learning with LLM was hardly working. Users had to trial and error their way to some obtuse prompt to get the LLM to generate some sensible response to their prompt, if it even worked at all. This is because LLM pre-training is purely about language structure without accounting for intent (what the human wishes to obtain via the prompt). Supervised fine-tuning based on instructions and output pairs helped but not by much. With RLHF however, the process is so effective that a mere 6B parameter model (fine-tuned with RLHF) is able to surpass a 175B parameter model. Check out the InstructGPT paper for details.

2

visarga t1_j6n5mgc wrote

Oh, yes, gladly. This "open"AI paper says it:

> Larger models are significantly more sample efficient, such that optimally compute efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

https://arxiv.org/abs/2001.08361

You can improve outcomes from small datasets by making the model larger.

1

andreichiffa t1_j6n9lg6 wrote

A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805

About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf

Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf

Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.

1