LetterRip t1_j85b07d wrote on February 11, 2023 at 7:08 PM

Reply to comment by norcalnatv in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv

Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.

https://arxiv.org/abs/2206.01861

Why not distillation?

https://transformer.huggingface.co/model/distil-gpt2

NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,

https://developer.nvidia.com/blog/deploying-gpt-j-and-t5-with-fastertransformer-and-triton-inference-server/

I think their assumptions are at least an order of magnitude pessimistic.

As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.