LetterRip t1_j85b07d wrote
Reply to comment by norcalnatv in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv
Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.
https://arxiv.org/abs/2206.01861
Why not distillation?
https://transformer.huggingface.co/model/distil-gpt2
NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,
I think their assumptions are at least an order of magnitude pessimistic.
As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.
Viewing a single comment thread. View all comments