Submitted by PK_thundr t3_11rnppe in MachineLearning

We have a 20-40k budget at our lab and we are interested in training LLMs on data that is protected by HIPAA which puts restrictions on using just any cloud provider. We'd need a compute environment with 256gb vram.

Would it be better to use AWS EC2 P3 instances or Google Cloud instead of trying to build our own server for this? We could spend the budget on a local server, but would this be obsolete within 2 years once the next gen GPUs are released?

11

Comments

You must log in or register to comment.

CKtalon t1_jc9hm91 wrote

Don't think a 40K budget can get you a machine with 256GB VRAM. It's barely enough to get 8xRTX6000 Ada, and that's ignoring how you would need a high-end workstation/server-grade CPU/motherboard to support 8 cards.

5

Disastrous_Elk_6375 t1_jc9ks2y wrote

There's a rent vs buy section in that article. It basically comes down to how much will you use the box, vs. how often / fast do you need to test things out. They go through energy costs and all that in the article. Just plug in your figures and see what the output is.

7

sanderbaduk t1_jc9o4hm wrote

Training for what? Classification, embedding, generation?

0

ggf31416 t1_jca7zwz wrote

https://fullstackdeeplearning.com/cloud-gpus/

Your best bet to reach 256Gb in the cloud would be Azure with 4x80GB A100 instances, however your 40k budget will only buy you 3000 hours of compute at best on demand, with spot instances stretching that a bit further.

If that's not enough for you then you will have to figure out how to make a server with RTX A6000 Adas with 48GB each. RTX4090 would be cheaper but there may be legal issues due to the gaming driver license, you would need to use multiple servers due to power usage or strongly limit the power limit, and Nvidia dropped P2P that may o may not matter depending on how much communication you need between the GPUs (https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366)

3

I_draw_boxes t1_jcia41b wrote

A fix for the Nvidia driver is forthcoming for the P2P related issue with PyTorch DDP training. The 3090 didn't support P2P either and the bug fix won't enable P2P for the 4090, but it will correct the issue and should train much faster once fixed.

1

londons_explorer t1_jcpzan9 wrote

I would make 'fake' data which isn't hipaa protected and do most of your work on that.

Then do a final fine-tuning on the HIPAA data on some rented servers. Your HIPAA data probably isn't more than a few hundreds of billion words anyway, so a fine-tuning should be quite quick and cheap to do a few full passes of the dataset.

1