Submitted by hx-zero t3_zl03b0 in MachineLearning
We made a library for inference/fine-tuning of open 175B+ language models (like BLOOM) using Colab or a desktop GPU. You join forces with other people over the Internet (BitTorrent-style), each running a small part of model layers. Check out our Colab example!
Thing is, even though BLOOM weights were publicly released, it was extremely difficult to run inference efficiently unless you had lots of hardware to load the entire model into the GPU memory (you need at least 3x A100 or 8x 3090 GPUs). E.g., in case of offloading, you can only reach the speed of ~10 sec/step for sequential (non-parallel) generation.
A possible alternative is to use APIs, but they are paid and not always flexible (you can’t adopt new fine-tuning/sampling methods or take a look at hidden states). So, Petals come to the rescue!
More details:
- Paper (with speed measurements): https://arxiv.org/abs/2209.01188
- GitHub repo: https://github.com/bigscience-workshop/petals
What do you think of it?
ReginaldIII t1_j02utp9 wrote
I've been looking at heterogenous compute a lot lately for some tasks related to this sort of problem.
Are you assuming that all of your workers are trustworthy all of the time? Do you have any consideration for bad actors poisoning the training? Or potentially encoding hidden/malicious data or leaking training data out of your computation? I'd be interested to hear what you are doing to mitigate these threats if you are looking at them.
Also, related to trustworthiness, is the question of fault tolerance. What mechanism are you using to pass and buffer chunks of inputs/outputs between workers? Do you ensure every chunk of data eventually gets processed by exactly one worker and the results definitely make it to their destination or is it a bit lossy for the sake of throughput?
I had been looking at chaining workers together using a mixture of local (on worker) and global (centralized in the cloud) kafka clusters to ensure every chunk of data does eventually make it through properly and nothing gets lost. A nice benefit of building on kafka is that multiple consumers looking at a queue can consume the same messages such that you can get voting by consensus for what the results to be passed on should be.
Kafka also really helps deal with buffering and availability of your workers to be ready to recieve work without worrying if they are going to drop incoming packets because they were busy at the time.
Interested to hear if you've hit any of these issues! :)