fiftyfourseventeen t1_jdz6eu7 wrote on March 28, 2023 at 7:08 AM

Reply to comment by ---AI--- in [D] FOMO on the rapid pace of LLMs by 00001746

The only way you are training your own GPT 3 level model for 600 is by spending 300 bucks on a gun, 300 bucks renting a u haul and heisting a datacenter

Edit: maybe cheap out on the gun and truck, can't forget about electricity costs of your newly acquired H100s

utopiah t1_jdzcevv wrote on March 28, 2023 at 8:37 AM

$500 https://github.com/tatsu-lab/stanford_alpaca#data-generation-process

[deleted] t1_jdzuwoq wrote on March 28, 2023 at 12:23 PM

[deleted]

fiftyfourseventeen t1_je0u1oj wrote on March 28, 2023 at 4:30 PM

You can't compare a lora to training a model lol

utopiah t1_je0zqae wrote on March 28, 2023 at 5:06 PM

Well I just did so please explain why not, genuinely trying to learn. I'd also be curious if you have a list of trained models compared by cost. I only saw some CO2eq order of magnitude equivalent but not rough price estimations so that would help me to get a better intuition as you seem to know more about this.

That being said the point was that you don't necessarily need to train anything from scratch or buy anything to have useful results, you cant rent per hour on cloud and refine existing work, no?

fiftyfourseventeen t1_je1gprd wrote on March 28, 2023 at 6:51 PM

If you just want to change the output of a model to look more like something else in its training data, sure. LoRa trains the attention layers (technically it trains a separate model but it can be merged into the attention layers), so it doesn't necessarily add anything NEW per se, but rather focuses on things the model has already learned. For example, if you were to try to make a model work well with a language not in its training data, LoRa is not going to work very well. However, if you wanted to make the model give things in a dialogue like situation (as is the case of alpaca), it can work because the model has already seen dialogue before, so the LoRa makes it "focus" on creating dialogue.

You can get useful results with just LoRa, which is nice. If you want to try to experiment with architecture improvements or large scale finetunes / training from scratch, you are out of luck unless you have millions of dollars.

I'd say the biggest limitation of LoRa is that your model for the most part already has to "know" everything that you are trying to do. It's not a good solution to add more information into the model (e.g. training it on information after 2021 to make it more up to date) with lora. That has to be a full finetune which is a lot more expensive.

As for the cost, I honestly don't know because these companies don't like to make data like that public. We don't even know for sure what hardware GPT 3 was trained on, although it was likely V100s, and then A100s for GPT 3.5 and 4. I think people calculated the least they could have spent on training was around 4.5 million for GPT 3, and 1.6 million for llama. That doesn't even include all the work that went into building an absolutely massive dataset and paying employees to figure out how to do distributed training across tens of thousands of nodes with multiple GPUs each.