Submitted by AngrEvv t3_11vb220 in deeplearning

Hi folks,

Our lab plans to purchase a server with some decent GPUs to perform some pertaining tasks for program codes. We won't work on very large LLM and we even may not try the T5 model. Currently, we want to first try the roBERTa model. We have a $50K budget. And it's our first time purchasing GPU servers.

I did some preliminary study and found the suggested GPU is A6000 ADA which has 48 GB GPU memory, according to https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/. Since our tasks require lots of GPU memory, we think a GPU with more than 32 GB will be good for us. So our alternative choices are RTX A6000 and A100 80GB HBM2 cards.

Based on these, we got three server specs from Exxact ( https://www.exxactcorp.com/TWS-115999024/configurator), (1) a $43K spec with 4 A6000 ADA cards, (2) a $32K spec with 4 RTX A6000 cards, and (3) a $41K spec with 2 A100 80GB cards. The other parts in the specs, e.g., CPU and RAM, are almost the same. I have attached the specs in screenshots.

Now, I have some questions.

  1. A6000 ADA removed NVLink (https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-nv-link-even-on-pro-gpus/230874) which is very important for performance boosting and GPU memory pooling. Does this mean it's a good choice to have multiple A6000 ADA cards on a server?
  2. A6000 ADA is a very new GPU improved from RTX A6000. But it has the NVLink, which means the server GPU memory can reach 48 * 4 GB when connecting 4 RTX A6000 cards. However, we are going to use the GPU server for several years. For IT products, it's always better to purchase the latest ones. Is that true for GPU cards? And A6000 ADA has more tensor and cuda cores than RTX A6000.
  3. For the A100 80GB spec, we can only have 2 cards wondering the budget. For the LLM pertaining, more cards usually mean more parallelism and faster training. Based on my study, A6000 ADA has comparable performance to A100 on DL benchmarks. Is this A100 80GB spec a good choice?
  4. Except for the ahead-mentioned specs, what else would you recommend for our pretraining tasks, especially for GPUs?

Thanks for your time! We really appreciate any suggestions.

18

Comments

You must log in or register to comment.

thesupernoodle t1_jcsj2iw wrote

For maybe a few hundred bucks, you can test out the exact configurations you want to buy:

https://lambdalabs.com/service/gpu-cloud

You may even decide that you’d rather just cloud compute, as opposed to spending all that money upfront. It would only cost you about 19 K to run 2xA100 in the cloud for 24/365 for a solid year. And that also includes electricity costs.

9

brainhack3r t1_jctctn2 wrote

This is the right answer. Don't guess, test (hey, that rhymed!)

Just make sure your testing mirrors what it would look like to scale upl.

3

FirstOrderCat t1_jcsjgws wrote

​

they don't have a6000 ada yet

1

thesupernoodle t1_jcsll6u wrote

Sure; but the broader point is they can optimize their need with some cheap testing - is the model big enough such that is wants the extra ram of an 80Gig A100?

2

CKtalon t1_jctb1c0 wrote

Do not be tricked by memory pooling. NVLink might not really improve performance on the A6000s by much (different case for the A100s)

I think it will be a tough choice between 2xA100/ and 4x 6000 Ada

5

bentheaeg t1_jcsvdy1 wrote

Not able to reply for sure right now (A6000 Ada are missing open tests), I don´t think that many people can. I work at a scale up though (PhotoRoom), and we're getting a 4xA6000 Ada server next week , we were planning to publish benchmarks vs. our other platforms (DGXs, custom servers, .. from A100 to A600 and 3090), stay tuned !

From a distance, semi educated guess:

- A6000 Ada are really, really good in compute. So models which are really compute bound (think Transformers with very big embeddings) should do well, models which are more IO bound (convnets for instance) will not do as well, especially vs. the A100 which has much faster memory

- the impact of nvlink is not super clear to me, its bandwidth was not really big to begin with anyway. My guess is that it may be more useful for latency bound inter GPU communications, like when using syncbatchnorm.

- there are a lot of training tweaks that you can use (model or pipeline parallel, FSDP, grad accumulation to cut on the comms..), so the best training setup for each platform may differ, it's also a game of apples to oranges, and this is by design

- I would take extra care around the cooling system, if you're not a cloud operator then a server going down will be a mess in your lab. This happened to us 3 times in the past 6 months, always because of the cooling. These machines can tap into 2kW+ H24 , this has to be extracted out and from our limited experience some setups (even from really big names) are not up to the task and go belly up in the middle of a job. 80GB A100s are 400 to 450W, A6000s (Ada or now) are 300W, easier to cool down if you're not buckled up. Not a point against the A100 per say, but a point against the A100 & unproven cooling let's say.

3

FirstOrderCat t1_jcsjdve wrote

> Based on my study, A6000 ADA has comparable performance to A100 on DL benchmarks. Is this A100 80GB spec a good choice?

it looks like you answered your questions yourself: 4 x a6000 ada will give you the best performance.

1