Submitted by johnhopiler t3_11a8tru in MachineLearning
Let's assume for a minute one has:
- the necessary compute instances
- enough $ to cough up to rent those instances somewhere
What are the latest "easy" solutions to get optbloomzand flan-t5hosted as API endpoints?
I spent about 2 weeks trying to get seldon-core and MLServer to work with its huggingface wrapper. But I've lost hope at this point. There are so many parameters and tweaks one has to be mindful of and I feel like I'm behaving like a very crude operating system replacement when I pass a device_mapto a python function to tell it how much ram to use for what instance. In what world can MS 95 manage 4 DIM DDR rams but in 2023, we cannot auto-assign model data to the right GPUs?
So. What's the "right way" to do this? I am aware of
- This repo that has some "demos": https://github.com/huggingface/transformers-bloom-inference
- accelerate library: https://huggingface.co/docs/accelerate/index
- FlexGen: https://github.com/FMInference/FlexGen but that only works for opt and is not a model hosting solution but more of an academic PoC
- DeepSpeed, haven't looked deeply into this though
Any pointers would be appreciated. We have a goal to get 2-3 models up and running as API endpoints in 2 weeks and I have a lot of ppl waiting for me to get this done...
​
Edit:
I am talking about self hosted solutions where the inference input & output is "under your control"
​
Edit:
What about a K8S + Ray Cluster + alpa.ai? It feels like the most industrialised version of all the things I've seen so far after reading up on ray (which feels like a spark cluster for ML)
Desticheq t1_j9qo0mu wrote
Hugginface actually allows a fairly easy deployment process for models trained with their framework