What's the best way to improve the inference speed of a Flan-T5 model?

Onnx runtime doesn't seem to work for T5 models & Torchscript also doesn't seem to help speed it up (not sure why!)

Comments

You must log in or register to comment.

LetterRip t1_j9ker51 wrote on February 22, 2023 at 5:03 PM

See this tutorial - converts to ONXX CPU, then to tensor-RT for a 3-6x speedup.

https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/

guillaumekln t1_j9nfl9t wrote on February 23, 2023 at 5:34 AM

You can also check out the CTranslate2 library which supports efficient inference of T5 models, including 8-bit quantization on CPU and GPU. There is a usage example in the documentation.

Disclaimer: I’m the author of CTranslate2.

_learn_faster_ OP t1_j9nuqe3 wrote on February 23, 2023 at 8:35 AM

For flan-t5 does this only work for a Translation task?

guillaumekln t1_j9nv5n0 wrote on February 23, 2023 at 8:41 AM

No. Even though the high-level class is named Translator, it can be used to run any tasks that would work using T5ForConditionalGeneration in the transformers library.

nikola-b t1_j9mdw5s wrote on February 23, 2023 at 12:33 AM

Might not be what you want, but you can use our hosted flan-t5 models at deepinfra.com. This way you can just call them as API. Even flan-t5-xxl. Disclaimer I work at Deep Infra.

[deleted] t1_j9o1cgw wrote on February 23, 2023 at 10:10 AM

[removed]

machineko t1_ja4jubd wrote on February 26, 2023 at 8:09 PM

Inference acceleration involves model accuracy / latency / cost trade-offs and also how much $ and time you are willing to spend to speed things up. Is your goal to achieve real-time? Can you do it while taking 2-3% accuracy hits? What compute resource is the model going to run on? On the cloud and you have access to any GPUs? For example, certain inference optimization techniques will only run on newer and more expensive GPUs.

For example, for highly scalable and low-latency deployment, you'd probably want to do model compression. And once you have a compressed model, you can optimize inference using TensorRT and/or other compilers/kernel libraries. Happy to share more thoughts, feel free to reply here or DM me with more details.

_learn_faster_ OP t1_ja6zovh wrote on February 27, 2023 at 8:34 AM

We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.

Let me know what you think would be best for us?

When you say compression do you mean things like pruning and distillation?