Viewing a single comment thread. View all comments

one_eyed_sphinx OP t1_j7tzoiq wrote

>eco

so this is the fine point that I want to understand, what I am trying to optimize with the build is the data transfer time, how much time it takes to load a model from RAM to VRAM. if I have10 models that need 16 GB of VRAM to run, the need to share resources. so I want to "memory hot swap" (I don't know if there is a proper term for it, I found "Bin packing") the models on an incoming request. so the data transfer is somewhat critical in my point of view and as I understand it, only the PCI speed is the bottleneck here, correct me if I'm wrong.

1

suflaj t1_j7u2qyt wrote

You want eco mode to run cooler and more efficient. As I said, the bottleneck is in the GPU, specifically its memory bandwidth, not in whatever the CPU can transfer. Modern CPUs can easily handle 3 high end GPUs at the same time, not just 2.

PCI speed has not been a bottleneck for several years, and will probably never be a bottleneck again with this form factor of GPUs. The GPU MEMORY is the bottleneck nowadays.

EDIT: And as someone else has said, yeah, you can use fast NVMEs as swap to avoid loading from disk. There used to be Optane for this kind of stuff, but well, that's dead.

2

one_eyed_sphinx OP t1_j7yqh5v wrote

>NVME

yeah, the GPU memory is horible bottleneck. I am trying to find ways to go around it but it doesnt seems there are too many best practices for it. is there a way to use pined memory for faster model data transfer?

1

suflaj t1_j7yr906 wrote

If GPU memory is the bottleneck then there is nothing you can viably do about that. If your GPU can't load the memory faster then you will need to get more rigs and GPUs if you want to speed up the loading in parallel.

Or you could try to quantize your models into something smaller that can fit in the memory, but then we're talking model surgery, not hardware.

2

allanmeter t1_j7ytp7i wrote

This is really good advice! Preprocessing input data for both training and inferencing is the best route to get efficiencies. Don’t feed it crazy large multidimensional dataset, try and break it up and have a look at if you can use old fashioned methods on windowing and down sampling.

Also model parameters type is important too. If you’re running fp64 then you will struggle vs a model that’s just int8. If you have mixed precision weights then you really need to think about looking at AWS Sage and get a pipeline going.

To OP, maybe you can share a little context on what models you’re looking to run? Or input data context.

1

allanmeter t1_j7u6v50 wrote

Yes the ram to vram transfer is not as crazy important as you think. Previously we hit this issue in the 3000 series as well, and as a result we supplemented with full TB Ram but still was not enough. Some models are incredibly greedy.

If you are on Linux, which is highly encouraged, also look to optimise your storage tier option for SWAP memory, which is similar to pagefiles in windows. You can define and mount extended Swap disks which you can trick out with multi TB nvme drives. Not same performance as RAM but last step optimisations, before you need to consider going to Quadro

2