Submitted by AngrEvv t3_11vb220 in deeplearning
Hi folks,
Our lab plans to purchase a server with some decent GPUs to perform some pertaining tasks for program codes. We won't work on very large LLM and we even may not try the T5 model. Currently, we want to first try the roBERTa model. We have a $50K budget. And it's our first time purchasing GPU servers.
I did some preliminary study and found the suggested GPU is A6000 ADA which has 48 GB GPU memory, according to https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/. Since our tasks require lots of GPU memory, we think a GPU with more than 32 GB will be good for us. So our alternative choices are RTX A6000 and A100 80GB HBM2 cards.
Based on these, we got three server specs from Exxact ( https://www.exxactcorp.com/TWS-115999024/configurator), (1) a $43K spec with 4 A6000 ADA cards, (2) a $32K spec with 4 RTX A6000 cards, and (3) a $41K spec with 2 A100 80GB cards. The other parts in the specs, e.g., CPU and RAM, are almost the same. I have attached the specs in screenshots.
Now, I have some questions.
- A6000 ADA removed NVLink (https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-nv-link-even-on-pro-gpus/230874) which is very important for performance boosting and GPU memory pooling. Does this mean it's a good choice to have multiple A6000 ADA cards on a server?
- A6000 ADA is a very new GPU improved from RTX A6000. But it has the NVLink, which means the server GPU memory can reach 48 * 4 GB when connecting 4 RTX A6000 cards. However, we are going to use the GPU server for several years. For IT products, it's always better to purchase the latest ones. Is that true for GPU cards? And A6000 ADA has more tensor and cuda cores than RTX A6000.
- For the A100 80GB spec, we can only have 2 cards wondering the budget. For the LLM pertaining, more cards usually mean more parallelism and faster training. Based on my study, A6000 ADA has comparable performance to A100 on DL benchmarks. Is this A100 80GB spec a good choice?
- Except for the ahead-mentioned specs, what else would you recommend for our pretraining tasks, especially for GPUs?
Thanks for your time! We really appreciate any suggestions.
thesupernoodle t1_jcsj2iw wrote
For maybe a few hundred bucks, you can test out the exact configurations you want to buy:
https://lambdalabs.com/service/gpu-cloud
You may even decide that you’d rather just cloud compute, as opposed to spending all that money upfront. It would only cost you about 19 K to run 2xA100 in the cloud for 24/365 for a solid year. And that also includes electricity costs.