kkchangisin t1_j5ijvdy wrote on January 23, 2023 at 6:19 AM

Reply to comment by NovaBom8 in [P] Benchmarking some PyTorch Inference Servers by op_prabhuomkar

Looking at the model configs in the repo there’s definitely dynamic batching going on.

I think what’s really interesting is the fact that even with default parameters for dynamic batching the response times are superior and very consistent.

kkchangisin t1_j5if8hc wrote on January 23, 2023 at 5:29 AM

Reply to comment by op_prabhuomkar in [P] Benchmarking some PyTorch Inference Servers by op_prabhuomkar

Depending on how much time I have there just might be a PR coming your way 😀…

Triton is really a somewhat hidden gem - the implementation and toolkit surrounding it is pretty impressive!

kkchangisin t1_j5gcgbe wrote on January 22, 2023 at 8:35 PM

Reply to [P] Benchmarking some PyTorch Inference Servers by op_prabhuomkar

Nice work! Triton already looks good but have you tried optimizing with the Triton Model Analyzer?

https://github.com/triton-inference-server/model_analyzer

In various models I use with Triton I've found the output model formats and configurations for use with Triton can provide drastically increased performance whether that be throughput, latency, etc.

Hopefully I get some time soon to try it out myself!

Again, nice work!

kkchangisin t1_j1bjz3f wrote on December 23, 2022 at 2:13 AM

Reply to comment by Soc13In in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_

CUDA only