antonivs t1_jdyp1zw wrote on March 28, 2023 at 3:49 AM

> I wouldn't get worried about training these models from scratch. Very few people are going to need those skills.

Not sure about that, unless you also mean that there are relatively few ML developers in general.

After the ChatGPT fuss began, one of our developers trained a GPT model on a couple of different subsets of our company's data, using one of the open source GPT packages, which is obviously behind GPT 3, 3.5, or 4. He got very good results though, to the point we're working on productizing it. Not every model needs to be trained on internet-sized corpuses.

Craksy t1_jdywiwi wrote on March 28, 2023 at 5:04 AM

Well that doesn't really contradict previous comment. They did mention fine tuning as an exception. GPT even stands for Generalized Pretrained Transformer. I'm sure some people like to draw hard lines between transfer learning/specialisation/fine tuning (different task or just different data) but at any rate, what you're describing can hardly be considered "training from scratch".

Indeed very few will need to be able to train models on that scale. In fact that was the whole motivation behind GPT. Training LLMs from scratch consumes a tremendous amount of resources, and 99% of that work goes into building a foundation that happens to generalize very well across many different tasks.

kalakau t1_jdzb1jx wrote on March 28, 2023 at 8:16 AM

> Generalized Pretrained Transformer

this is pedantic but it's actually Generative PT

Craksy t1_jdzbgzj wrote on March 28, 2023 at 8:22 AM

Not at all.
While it doesn't mean the world for the point I was trying to make, it does change the meaning quite a bit.

Thank you for the correction

antonivs t1_je1cuw1 wrote on March 28, 2023 at 6:28 PM

My description may have been misleading. They did the pretraining in this case. The training corpus wasn't natural language, it was a large set of executable definitions written in a company DSL, created by customers via a web UI.

Craksy t1_je3tzt3 wrote on March 29, 2023 at 5:28 AM

Aah, got you. My bad. Well, I suppose most people mainly think of NLP in these kind of contexts. That's where my mind went, anyway.

Training from scratch on a DSL is indeed an entirely different scale of problem (assuming it's not some enormous, complex DSL that relies heavily on context and thousands of years of culture to make sense of).

Sounds very interesting though. If you're allowed to share more information, I'd love to hear about it

antonivs t1_je82r3j wrote on March 30, 2023 at 2:14 AM

Well, I do need to be a bit vague. The main DSL has about 50 instructions corresponding to actions to be performed. There's also another different sub-DSL, with about 25 instructions, to represent key features of the domain model, that allows particular scenarios to be defined and then recognized when executing.

Both DSLs are almost entirely linear and declarative, so there's no nested structure, and the only control flow is a conditional branch instruction in the top-level DSL, to support conditional execution and looping. The UI essentially acts as a wizard, so that users don't have to deal with low-level detail.

There are various ideas for the GPT model, including suggesting instructions when creating a program, self-healing when something breaks, and finally generating programs from scratch based on data that we happen to already collect anyway.

NLP will probably end up being part of it as well - for that, we'd probably use the fine-tuning approach with an existing language model as you suggested.

visarga t1_jdzt9gd wrote on March 28, 2023 at 12:08 PM

> Generalized 😓

abnormal_human t1_jdywyac wrote on March 28, 2023 at 5:08 AM

I'm in the midst of a similar project. It also doesn't require massively expensive compute because for domain specific tasks, you often don't need models with gajillions of parameters to achieve business-interesting results.

antonivs t1_jdz6vai wrote on March 28, 2023 at 7:15 AM

Exactly what I was getting at, yes.

happycube t1_jdzq0v4 wrote on March 28, 2023 at 11:36 AM

nanoGPT's good for this sort of from-scratch training, there's an updated version of the classic char-RNN Shakespeare model in the repo.

antonivs t1_je0pfza wrote on March 28, 2023 at 4:00 PM

Thanks! I actually don't know exactly what this guy used, I'll have to check.

Qpylon t1_jdzmiaq wrote on March 28, 2023 at 10:57 AM

I’m curious, is this for your company wiki or something? Was considering trying that with our documentation etc.

antonivs t1_je0pb85 wrote on March 28, 2023 at 4:00 PM

Our product involves a domain-specific language, which customers typically interface to via a web UI, to control the behavior of execution. The first model this guy trained involved generating that DSL so customers could enter a natural language request and avoid having to go through a multi-step GUI flow.

They've tried using it for docs too, that worked well.

dancingnightly t1_je0o082 wrote on March 28, 2023 at 3:51 PM

The benefit of finetuning or training your own text model (e.g. in the olden days on BERT), now through the OpenAI API vs the benefit of just using contextual semantic search is reducing day-by-day... especially with the extended context window of GPT-4.

If you want something in house, finetuning GPT-J or so could be the way to go, but it's definitely not the career direction I'd take.

antonivs t1_je1d8o0 wrote on March 28, 2023 at 6:30 PM

The training corpus size here is in the multi-TB range, so probably isn't going to work with the OpenAI API currently, from what I understand.

SlowThePath t1_je2buak wrote on March 28, 2023 at 10:08 PM

No models are trained on internet sized corpuses.That would take an infinite amount of time. I would think.

antonivs t1_je7ws1v wrote on March 30, 2023 at 1:28 AM

I was referring to what the OpenAI GPT models are trained on. For GPT-3, that involved about 45 TB of text data, part of which was Common Crawl, a multi-petabyte corpus obtained from 8 years of web crawling.

On top of that, 16% of its corpus was books, totaling about 67 billion tokens.

SlowThePath t1_je7xmaz wrote on March 30, 2023 at 1:35 AM

Definitely not denying that it was trained on a massive amount of data because it was, but calling it internet sized is not accurate. I guess you were speaking in hyperbole and I juts didn't read it that way. I know what you mean.

machineko t1_je05orp wrote on March 28, 2023 at 1:50 PM

I agree. While these giant centralized models are all over the news, there are ways to make smaller models much more efficient (i.e. LoRA mentioned above). And during the process working with these techniques, we can perhaps discover new methods and architecture .

We are working on an open-source project focused on making fine-tuning for LLMs, simple, fast and efficient: https://github.com/stochasticai/xturing.

OP, we till got a ton of stuff we want to try out to make fine-tuning faster and more compute/memory efficient, if you are interested in contributing.

dimem16 t1_jdy7aja wrote on March 28, 2023 at 1:30 AM

Thanks for your insight. Could you share the link to the video please?

rshah4 t1_jdy7o2h wrote on March 28, 2023 at 1:33 AM

Here is my video: https://youtu.be/YKCtbIJC3kQ

Here is the blog post its based on: https://www.philschmid.de/fine-tune-flan-t5-peft

Efficient Large Language Model training with LoRA and Hugging Face

[deleted] t1_jdyrtnq wrote on March 28, 2023 at 4:15 AM

[removed]

dimem16 t1_jdzsrhh wrote on March 28, 2023 at 12:03 PM

Thanks:)

[deleted] t1_jdyp2rj wrote on March 28, 2023 at 3:49 AM

[deleted]

crazyvaclav3 t1_jdzx86v wrote on March 28, 2023 at 12:43 PM

Is the video available? I'd love to see it

rshah4 t1_je00t24 wrote on March 28, 2023 at 1:13 PM

Here is my video: https://youtu.be/YKCtbIJC3kQ
Here is the blog post its based on: https://www.philschmid.de/fine-tune-flan-t5-peft
Efficient Large Language Model training with LoRA and Hugging Face

I should also post in ML - I will do that later today

modernzen t1_je6xujz wrote on March 29, 2023 at 9:12 PM

Totally agree with this. Something like ChatGPT is overkill for most use cases and comes at a cost of both money (using the API) and latency. Clever prompting and fine-tuning can let you build free, fast models that are tailored towards your specific problem at hand.

[D] FOMO on the rapid pace of LLMs

rshah4 t1_jdy0mjg wrote on March 28, 2023 at 12:40 AM