Let's assume for a minute one has:

the necessary compute instances
enough $ to cough up to rent those instances somewhere

What are the latest "easy" solutions to get optbloomzand flan-t5hosted as API endpoints?

I spent about 2 weeks trying to get seldon-core and MLServer to work with its huggingface wrapper. But I've lost hope at this point. There are so many parameters and tweaks one has to be mindful of and I feel like I'm behaving like a very crude operating system replacement when I pass a device_mapto a python function to tell it how much ram to use for what instance. In what world can MS 95 manage 4 DIM DDR rams but in 2023, we cannot auto-assign model data to the right GPUs?

So. What's the "right way" to do this? I am aware of

This repo that has some "demos": https://github.com/huggingface/transformers-bloom-inference
accelerate library: https://huggingface.co/docs/accelerate/index
FlexGen: https://github.com/FMInference/FlexGen but that only works for opt and is not a model hosting solution but more of an academic PoC
DeepSpeed, haven't looked deeply into this though

Any pointers would be appreciated. We have a goal to get 2-3 models up and running as API endpoints in 2 weeks and I have a lot of ppl waiting for me to get this done...

Edit:

I am talking about self hosted solutions where the inference input & output is "under your control"

Edit:

What about a K8S + Ray Cluster + alpa.ai? It feels like the most industrialised version of all the things I've seen so far after reading up on ray (which feels like a spark cluster for ML)

Comments

Desticheq t1_j9qo0mu wrote on February 23, 2023 at 9:45 PM

#1,948,473

Hugginface actually allows a fairly easy deployment process for models trained with their framework

[deleted] t1_j9qtso8 wrote on February 23, 2023 at 10:21 PM

#1,949,274

Replying to Desticheq (#1,948,473)

[removed]

rajrondo t1_j9qx48o wrote on February 23, 2023 at 10:42 PM

#1,949,719

Not sure if I'm understanding you correctly, but would solutions like https://replicate.com/ or https://dev.pyqai.com/ be useful?

CKtalon t1_j9r2k9j wrote on February 23, 2023 at 11:19 PM

#1,950,421

Probably FasterTransformers with Triton Inference Server

memberjan6 t1_j9rsdvk wrote on February 24, 2023 at 2:26 AM

#1,954,122

Cohere, deepset, ....

theLastNenUser t1_j9uwhcd wrote on February 24, 2023 at 7:01 PM

#1,972,176

Replying to Desticheq (#1,948,473)

You will have to message them if you want to use the larger GPU boxes, and the autoscaling isn’t great for larger models. The customizability of the “handler.py” file is nice though

bmunday131 t1_j9vc362 wrote on February 24, 2023 at 8:41 PM

#1,975,173

A Chassis + Modzy solution could get these models up and running as endpoints in a couple days max.

Here are some docs links and if at all interested, feel free to message me separately. Happy to discuss in more detail.

https://chassis.ml/
https://docs.modzy.com/docs/hugging-face

Desticheq t1_j9xiv9l wrote on February 25, 2023 at 7:20 AM

#1,988,467

Replying to theLastNenUser (#1,972,176)

Well, in terms of "out-of-the-box," I'm not sure what else could be better. AWS, Azure or Google provide empty units basically, and you'd have to configure all the "Ops" stuff like network, security, load balancing, etc. That's not that difficult if you do it once in a while, but for a "test-it-and-forget-it" project it might be too difficult.

whata_wonderful_day t1_ja3kh4d wrote on February 26, 2023 at 4:17 PM

#2,026,842

Replying to CKtalon (#1,950,421)

Yeah this is what the big bois use. It'll give you max performance, but isn't exactly user friendly

[P] What are the latest "out of the box solutions" for deploying the very large LLMs as API endpoints?