Submitted by [deleted] t3_zrx665 in MachineLearning
[deleted]
Submitted by [deleted] t3_zrx665 in MachineLearning
[deleted]
Yeah I mean OPT. We are already using aws for some tests but it is expensive as we have a lot of experiments. My idea is to use customs pc to train models and then use them on aws for inference as it is a b2c saas.
[removed]
So if you used clossalai u could do locally with a reasonable budget 2-3k. Without that it depends on how much u are using it. If you use it a lot you can do a home setup for anywhere from 32-64k. The cheapest you can do an 8 gpu nvidia home server is around 12k and you need 4x to do 500 Gb vram. (32k new 16gb ram gpu ~1k each ) with used everything maybe you could get to 8k. ( this assumes you could do 16 servers with 4 gpu each all used ) https://alpa.ai/tutorials/opt_serving.html
TLDR largest opt is going to require some model pipelining to keep costs down
If slow training is acceptable you can use DeepSpeed with the weights mapped to the NVME drive (DeepSpeed ZeRo Infinity). It will take significantly longer to fine tune, but dramatically lowers the hardware investment.
Depends. https://vast.ai/ is great but has certain limitations. If you can run on 1-4 24GB 3090 RTX cards that's going to be the best value rolling your own. 4090s of course good too but you'd need to find a good deal to make it worth it vs the 3090s. You can always start with 1 and go from there. Otherwise you'll be paying 10X more for some A100s. First step is to get a real good handle on how much compute you're actually using and what the smallest gpu/vram size works efficiently for your data.
Very difficult to give a definite answer without knowing more about your situation. Things to consider:
P.S. I work for AWS, so probably have a bias.
rhofour t1_j153vzt wrote
GPT is not a public model. You can't train or run it yourself.
I just checked and saw OpenAI does have a fine tuning API so you can fine tune and use the model through their API, but your hardware doesn't matter.
You can look at open source reproductions of GPT like OPT, but it will be very expensive to get the hardware to run the model, let alone train it. If you really want to use one of these huge models yourself (and not through an API) I'd advise starting with AWS before you consider buying any hardware.