Submitted by hx-zero t3_zl03b0 in MachineLearning

We made a library for inference/fine-tuning of open 175B+ language models (like BLOOM) using Colab or a desktop GPU. You join forces with other people over the Internet (BitTorrent-style), each running a small part of model layers. Check out our Colab example!

Thing is, even though BLOOM weights were publicly released, it was extremely difficult to run inference efficiently unless you had lots of hardware to load the entire model into the GPU memory (you need at least 3x A100 or 8x 3090 GPUs). E.g., in case of offloading, you can only reach the speed of ~10 sec/step for sequential (non-parallel) generation.

A possible alternative is to use APIs, but they are paid and not always flexible (you can’t adopt new fine-tuning/sampling methods or take a look at hidden states). So, Petals come to the rescue!

This is how Petals work: some peers want to use a pretrained LM to solve various tasks with texts in natural or programming languages. They do it with help of other peers, who hold subsets of model layers on their GPUs.

More details:

What do you think of it?

273

Comments

You must log in or register to comment.

ReginaldIII t1_j02utp9 wrote

I've been looking at heterogenous compute a lot lately for some tasks related to this sort of problem.

Are you assuming that all of your workers are trustworthy all of the time? Do you have any consideration for bad actors poisoning the training? Or potentially encoding hidden/malicious data or leaking training data out of your computation? I'd be interested to hear what you are doing to mitigate these threats if you are looking at them.

Also, related to trustworthiness, is the question of fault tolerance. What mechanism are you using to pass and buffer chunks of inputs/outputs between workers? Do you ensure every chunk of data eventually gets processed by exactly one worker and the results definitely make it to their destination or is it a bit lossy for the sake of throughput?

I had been looking at chaining workers together using a mixture of local (on worker) and global (centralized in the cloud) kafka clusters to ensure every chunk of data does eventually make it through properly and nothing gets lost. A nice benefit of building on kafka is that multiple consumers looking at a queue can consume the same messages such that you can get voting by consensus for what the results to be passed on should be.

Kafka also really helps deal with buffering and availability of your workers to be ready to recieve work without worrying if they are going to drop incoming packets because they were busy at the time.

Interested to hear if you've hit any of these issues! :)

27

hx-zero OP t1_j03tsul wrote

Regarding fault tolerance:

  • No chunk losses involved — if a client has trouble sending/receiving chunks from a certain server, it will try other servers holding the necessary blocks until it gets a valid response.
  • We don't use any centralized queues like Kafka, instead the client code chooses and traverses servers by itself until it makes a full forward/backward pass. In this architecture, you can still make the client send the same request to multiple servers (if you want to validate servers' responses against each other or just get the response as soon as possible).

Regarding security & privacy:

  • Peers only exchange tensors (activations, gradients) serialized with safe protocols and ask each other to run pre-defined BLOOM blocks on them. They never send code to each other, so no one can execute their own code on your computer.
  • It may be possible for peers serving model layers to recover input data and model outputs, or modify the outputs in a malicious way. That's why we ask to never use the public swarm for sensitive data (not just pet projects/research) in the repo & notebook at the moment. Instead, you can set up a private Petals swarm hosted by people/orgs you trust. For example, several small companies/labs may collaborate and set up a private swarm to protect their data from others, while still getting benefits of Petals.
  • Still, we have plans to improve security of the public swarm in future:
    • (a) We plan to add an option for the client to send the same request to several servers and identify discrepancies (if any).
    • (b) We're working on a reputation system, so a server who returned invalid outputs loses its reputation and won't be chosen by clients again. The invalid outputs can be reported by clients or detected by special "anti-fraud" nodes that periodically validate the various servers' outputs.
20

ReginaldIII t1_j03wlpe wrote

Awesome, thanks for the details!

I like your reputation scaling idea, although dynamic reputation/trust scaling can be tricky to implement nicely in practice so I don't envy the task.

I think vote by consensus helps solve the problem especially when your worker population is high enough you can duplicate a lot of the work. But that does ultimately limit scaling efficiency with more worker nodes.

Can I ask, have you done any scaling experiments for large models on samples per second or training steps per second with an increasing number of workers, compared to the gold standard environment of a proper HPC cluster running MPI for communication? And also against existing Federated and Split Learning systems?

I realize a crowd structured compute environment is not aiming to hit the raw performance of these environments but I think these scalability comparisons would give a strong baseline to compare off of, and also to see future improvements against.

3

hx-zero OP t1_j03yfov wrote

Yeah, we compared Petals to a server with 3x A100 running tensor-parallel code based on Megatron-DeepSpeed, see the green row in this table. The table also shows how Petals performance degrades if we have concurrent clients and how it compares to offloading.

Adding more servers usually doesn't make the inference speed significantly faster. New servers mostly help with the swarm capacity, so it can provide the speed of ~1 step/sec to a larger number of clients.

I don't think we've done any comparisons with Federated/Split learning systems since, as far as I understand, they mostly don't work well on models of that size (100B+ parameters). But let us know if there're such systems, maybe we will compare Petals to some of them.

4

ReginaldIII t1_j04del0 wrote

Could you help me understand the split labels?

What specifically do you mean by "Offloading on 1x A100" ? Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without the overheads of going through full on Petals?

Is there a difference between "Petals on 3 physical servers" and "Petals on 14 real servers" other than the number?

What you mean by "Petals on 12 virtual servers, simulated on 3x A100" and also by "Same, but with 8 clients running simultaneously" ?

Many thanks :)

1

hx-zero OP t1_j04g7yj wrote

Sure!

Regarding offloading:

  • Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.

  • The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.

  • Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

  • The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
  • The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

  • Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).

3

w00t_loves_you t1_j03qps0 wrote

Would it be possible to repeat the same training tasks on multiple workers and verify the workers against each other?

OTOH it's more work to create a malicious worker than creating a malicious free LM, no?

5

ReginaldIII t1_j03sbkj wrote

> Would it be possible to repeat the same training tasks on multiple workers and verify the workers against each other?

That's what I meant here.

>> A nice benefit of building on kafka is that multiple consumers looking at a queue can consume the same messages such that you can get voting by consensus for what the results to be passed on should be.


> OTOH it's more work to create a malicious worker than creating a malicious free LM, no?

Different types of malicious. A malicious worker could leak data it's passed off to someone else or it could work to destabilize the training limiting final accuracy or causing overfits.

If you are a company brokering access to privately trained LLM's and you have the opportunity to prevent a crowd sourced LLM reaching as good quality as your own there could exist an incentive to harm that effort. Corporate espionage is a thing.

There are plenty of ways in which a crowd-computing effort could be misused or attacked.

3

bacteriarealite t1_j04qfpv wrote

Interesting that a lot of those problems could be solved by an implementation run on a blockchain. Wondering if anyone knows of efforts that have tried that?

−3

ReginaldIII t1_j04vb36 wrote

None of the issues I have raised can be solved by using blockchains.

We don't need to prove immutably that every chunk of work was processed for later auditing. We need to make sure every chunk gets processed "right now" as it is happening.

Blockchains do not present a solution to fault tolerance, they present a solution to auditing.

Blockchains also don't present a solution to trustworthiness here. In the same way that a wallet being present in a transaction on the blockchain says nothing about the real identity of the parties, nor does it say anything about whether the goods or services the transaction was for were carried out honestly.

Chunks of work encoded on the blockchain would tell you nothing about whether the activations and gradients computed were correct or manipulated, it would only tell you that they had in fact happened.

7

race2tb t1_j07gm1g wrote

Directly no, but indirectly it would incentivise actors to come up with better solutions to these problems. Depending on altruism is a sure fire way to make sure a project inevitably dies.

1

bacteriarealite t1_j04xqbi wrote

Blockchain technology would absolutely accomplish the issue of trusting your workers. Why else do people invest millions in mining rigs? Because of a system of decentralized trust built on the blockchain, where they won’t gain any benefit from trying to create fake/malicious blocks. It would both incentivize people to donate their resources and create a cryptographically secured system so you can trust the results you are getting. You don’t need everyone on chain to rerun the analysis, that’s just one form of validation. All you need is a system of nodes that trust other local nodes through periodic validation. It may require more resources but you’ve solved both the issue of trust and incentivizing work which will negate any increased burden that the periodic repeat validation requires.

−2

ReginaldIII t1_j050qmd wrote

Explain to me the mechanism by which you would encode the "correctness" of a result as a transaction or even smart contract on an idealized blockchain.

> Blockchain technology would absolutely accomplish the issue of trusting your workers. Why else do people invest millions in mining rigs? Because of a system of decentralized trust built on the blockchain, where they won’t gain any benefit from trying to create fake/malicious blocks.

We are talking about "trusting" fundamentally different things. A blockchain would be able to encode that at a specific point in time a worker going by some name returned something. It would be immutably stored in the blockchain, such that in the future we can look back and say "Yes, at that specific point in time a worker going by that name returned something".

And that tells us nothing about whether that worker returned the "correct" result, or a manipulated one.

I am talking about where the worker has returned the value that it proposes is the result and we care about having a mechanism to trust that the value itself is "correct" and therefore the worker has, at least this time, acted in a trustworthy fashion.

So if I am missing something, please, explain to me the mechanism by which you would encode the "correctness" of a set of activations and gradients for a chunk of work on a blockchain?

5

bacteriarealite t1_j052zlu wrote

> And that tells us nothing about whether that worker returned the "correct" result, or a manipulated one.

It actually does. A worker that creates a fake block would need a consensus of the nodes on the chain to verify that block to get it added. That’s precisely how blockchain technology creates trust - it has nothing to do with the ledger being public, but is about having a consensus of nodes verifying the cryptographic signature on a block before adding it to the chain and then growing along that consensus chain so that eventually it’s computationally insurmountable to reverse the direction of the chain back to the fake block you are trying to create.

The most obvious solution with respect to our discussion here is requiring that every node validates the finding. Easy to understand how that system could create a trusted system but obviously it’s useless. So alternatively could make validators just validate that your in a local minimum while the original validation tries to evaluate a more global feature space. Or an alternative option like I said before is to have more local validation sectors where you trust people in your local network because of confirmed results.

And I’m not trying to say I have the solution here, but I think it’s pretty obvious that blockchain technology solves these problems with just some tinkering around with the mechanisms of consensus’s and chain building.

−2

ReginaldIII t1_j053c4i wrote

I earnestly believe it solves problems that contain similar words. But it just does not present a practical solution to this problem.

We can't put the returned values on the blockchain. It just isn't possible to store them they are too big and too many, and there is no reason to store them, we only want to pass them onto the next worker or workers that immediately need them. We do care about fault tolerance to make sure they get to their destination.

So there's no way for this pool of blockchain nodes to form a consensus over the returned values being "correct" like this. We can't put the relevant information on the blockchain to allow it be compared.

What you end up with is just a classic non-blockchain vote by agreement system between workers of unknown trustworthiness. No blockchain needed.

You are correct that voting by consensus is needed, you just don't need all the rest of the things that turn that into a blockchain.

6

[deleted] t1_j054kbq wrote

[deleted]

1

ReginaldIII t1_j054n38 wrote

Please read my updated comment.

> I think the use case here is pretty obvious

With the greatest of respect, I don't.

> and I tried to just give some basic examples but I’m certainly not an expert and have not been involved in the types of trouble shooting required to get something like this working.

Also with the greatest of respect, I am an expert in this area, and have also worked with blockchains extensively.

I do not think blockchain is a "stream of buzzwords". I think it is the wrong tool to solve "this" problem.

2

kaibee t1_j05omff wrote

> I earnestly believe it solves problems that contain similar words. But it just does not present a practical solution to this problem. > > > > We can't put the returned values on the blockchain. It just isn't possible to store them they are too big and too many, and there is no reason to store them, we only want to pass them onto the next worker or workers that immediately need them. We do care about fault tolerance to make sure they get to their destination. > > > > So there's no way for this pool of blockchain nodes to form a consensus over the returned values being "correct" like this. We can't put the relevant information on the blockchain to allow it be compared. > > > > What you end up with is just a classic non-blockchain vote by agreement system between workers of unknown trustworthiness. No blockchain needed. > > > > You are correct that voting by consensus is needed, you just don't need all the rest of the things that turn that into a blockchain.

This is basically solving a very similar problem. https://rendertoken.com/#intro

1

ReginaldIII t1_j06mqeo wrote

In rendertoken's scenario we don't have a requirement on high throughput of one job feeding into another.

The individual units of work are expensive and long lived. Rendering a frame of a film takes roughly the same amount of time it did a few years ago, we just get higher fidelity output for that same render budget. All the frames can be processed lazily by the compute farm, and the results just go into a pool for later collection.

Because the collation of the results happens in a more offline fashion from the actual computation, you have time and resources to encode the results on a blockchain. Auditing that your requested work was processed is a desirable quality, and so a blockchain does provide a benefit.

In the case of distributed model training the scenario is different. We have high throughput of comparatively small chunks of work. Other than passing the results to the next immediate worker to do the next part of the computation, we have no desire (or storage capacity) to keep any of the intermediate results. Because we have high throughput of many small chunks a blockchain encoding these chunks would need a small proof of work and so would not be a reliable source of truth anyway.

Then consider that we don't even care about having an audit trail to prove historical chunks really were processed when we think they were. We only care about checking results are valid on the fly as we are doing the compute.

We just need a vote by agreement on the immediate results so they can be handed off to the next workers. Yes blockchains often have a vote by agreement part to how they decide what the actual state of the blockchain is, but we just need that part. We don't actually need the blockchain itself.

2

bacteriarealite t1_j0561xo wrote

You don’t need to put any model information on the blockchain. The goal of the blockchain is that it’s creating a network of trust tied to computational work. All we want from the blockchain is to be able to say we trust this node and it’s providing work. There are many ways you can then go about setting that up and debate the nuanced details of what will work best. But the utility of blockchain is pretty simple - we want decentralized work, we want decentralized trust. Blockchain is the only technology that does that. In fact blockchain is really just a synonym for those two things. So on your first post when you questioned how we would be able to trust this decentralized work, the answer to that is simple - blockchain. The details past that have million dollar answers but the underlying principle is pretty straight forward.

−1

ReginaldIII t1_j056gyk wrote

Okay. We are going in circles now, and I've responded to these points at length.

The burden on you is to now flesh this idea out and show it can work in practice for this problem.

I will not be pursuing this avenue.

Good evening.

2

bacteriarealite t1_j057m7a wrote

The burden on me was just to point out that when someone is looking for a way to create decentralized trust in a system of decentralized work that the solution is blockchain. My point in the first comment wasn’t to hash out all the nuanced details of how that would work in practice, I was just pointing out that if your looking to create trust in a decentralized system then the best (and honestly only) way to do that is blockchain.

−2

JohnnyTangCapital t1_j03a03h wrote

This is an awesome idea, excited to see where this goes!

6

_Arsenie_Boca_ t1_j03flw7 wrote

Fascinating! I wonder what google has to say about colab being used that way

3

ReginaldIII t1_j04yp5d wrote

They banned a lot of people that were running BOINC nodes in their colab sessions.

That sort of work sharing is pretty much in the same spirit as this.

5

hx-zero OP t1_j05i3j5 wrote

A Petals client does not allow others to use your GPU by default, you need to explicitly run a Petals server (a separate program) for this.

In the Colab example, we only run the client, so its GPU can't be used by anyone besides the user directly running the notebook.

3

ReginaldIII t1_j06l171 wrote

Sorry this is what I meant. I assumed OP was asking about running workers on colab.

Running the client would of course be no different than running anything that needs to draw in assets from the network.

2

race2tb t1_j032p2z wrote

This is a great idea. I really hope it is a big success and gains lots of traction. I can see crypto chains popping up around large model creation eventually.

2

w00t_loves_you t1_j03r3z7 wrote

I don't see how bolting crypto to the side will improve anything? It won't be able to directly help with training, right? And if it's just for paying, any cheap and fast payment system will work?

8

thecodethinker t1_j052uye wrote

Not a defense of decentralized finance, but the point of the whole blockchain system is to ensure a group of arbitrary nodes can agree on the state of some data.

You can use blockchain technology without having a coin, token, or any of that nonsense. Just not in the way people have been talking about it.

Though blockchains would probably be too slow for something like this.

5

ReginaldIII t1_j06nan5 wrote

> Though blockchains would probably be too slow for something like this.

This is the key point. Blockchains give a confidence bound on trustworthiness by being too slow moving and computationally expensive to manipulate. This is vital when proving a historical audit trail is correct and immutable.

It just isn't important or applicable for high throughput applications where you just care about local immediate correctness of intermediate results.

To quote one of my other comments in this thread

> Blockchains also don't present a solution to trustworthiness here. In the same way that a wallet being present in a transaction on the blockchain says nothing about the real identity of the parties, nor does it say anything about whether the goods or services the transaction was for were carried out honestly.

We care about whether or not you got ripped off by the guy you gave money to (the GPU you gave data to). We don't care about proving you did actually give them the money at a specific point in time.

2

Acceptable-Cress-374 t1_j03yizm wrote

Could you use this to run inference on gpt-neoX using 2-4 computers w/ 3090s? IIRC it requires ~40gb VRAM at inference, and multiple of that for finetuning...

2

hx-zero OP t1_j03zy85 wrote

Yes, it's technically possible to integrate GPT-NeoX in our code instead of BLOOM (requires some work, but it's not too hard).

Also, it may be possible to fit GPT-NeoX into 20 GB of VRAM (i.e., one 3090) using recent LLM.int8() work: https://huggingface.co/blog/hf-bitsandbytes-integration We use this approach to make BLOOM consume as few memory as possible in Petals.

6

londons_explorer t1_j03uyis wrote

People are talking about malicious clients...

But I wonder more about accidentally evil clients. For example, someone was modding the code and accidentally connects to the public swarm and returning NaN for every request.

Can we have an API which returns the git hash that it's running from and detects if the working directory is clean?

1

ReginaldIII t1_j04y036 wrote

That's what I mean by malicious clients.

You'd be relying on the malicious client to self report the git hash of the code it is "running". It can just lie to you.

The only defence is by duplicating each computation across multiple workers in the pool and having them compare results, most common result wins.

0

londons_explorer t1_j0642cz wrote

I suspect there will be far more accidentally malicious clients (because someone is experimenting with the code) than deliberately evil clients.

In fact, as long as the percentage of deliberately evil clients is small, and the weight updates they send have capped magnitudes, I suspect you can just ignore the problem - they won't have a substantial effect on the results.

1

randyzmzzzz t1_j04c6jk wrote

Is this federated learning?

1

hx-zero OP t1_j04dewf wrote

Not really: federated learning focuses on data privacy (and doesn't usually involve huge models), Petals focuses on making it possible to run a huge model without having much resources yourself (and doesn't give data privacy guarantees)

3

TrueBirch t1_j078kbj wrote

I have a tangential question. I work in a 1,000-person corporation. Lots of employees keep their computers running 24/7. Do you think there's potential in running models in a distributed manner across corporate LANs?

1

hx-zero OP t1_j07df6n wrote

I think this is reasonable if these computers have GPUs.

1

SleekEagle t1_j07bxyi wrote

I thought distributed training over the internet was prohibitively slow due to communication overhead - wouldn't you run into the same issue when fine-tuning? If anyone could ELI5 why/how this works that would be awesome!

1

hx-zero OP t1_j07d431 wrote

Training from scratch is slow because you need to synchronize all model weights/gradients on each step (though it's possible for somewhat smaller models with some optimizations).

In case of fine-tuning (especially prompt tuning), you train only a small percent of weights, so communication overhead is not that huge anymore. Still, this allows to adapt the LM to most downstream tasks.

2