Submitted by _learn_faster_ t3_1194vcc in MachineLearning
_learn_faster_ OP t1_ja6zovh wrote
Reply to comment by machineko in [D] Faster Flan-T5 inference by _learn_faster_
We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.
Let me know what you think would be best for us?
When you say compression do you mean things like pruning and distillation?
Viewing a single comment thread. View all comments