Viewing a single comment thread. View all comments

hx-zero OP t1_j03tsul wrote

Regarding fault tolerance:

  • No chunk losses involved — if a client has trouble sending/receiving chunks from a certain server, it will try other servers holding the necessary blocks until it gets a valid response.
  • We don't use any centralized queues like Kafka, instead the client code chooses and traverses servers by itself until it makes a full forward/backward pass. In this architecture, you can still make the client send the same request to multiple servers (if you want to validate servers' responses against each other or just get the response as soon as possible).

Regarding security & privacy:

  • Peers only exchange tensors (activations, gradients) serialized with safe protocols and ask each other to run pre-defined BLOOM blocks on them. They never send code to each other, so no one can execute their own code on your computer.
  • It may be possible for peers serving model layers to recover input data and model outputs, or modify the outputs in a malicious way. That's why we ask to never use the public swarm for sensitive data (not just pet projects/research) in the repo & notebook at the moment. Instead, you can set up a private Petals swarm hosted by people/orgs you trust. For example, several small companies/labs may collaborate and set up a private swarm to protect their data from others, while still getting benefits of Petals.
  • Still, we have plans to improve security of the public swarm in future:
    • (a) We plan to add an option for the client to send the same request to several servers and identify discrepancies (if any).
    • (b) We're working on a reputation system, so a server who returned invalid outputs loses its reputation and won't be chosen by clients again. The invalid outputs can be reported by clients or detected by special "anti-fraud" nodes that periodically validate the various servers' outputs.
20

ReginaldIII t1_j03wlpe wrote

Awesome, thanks for the details!

I like your reputation scaling idea, although dynamic reputation/trust scaling can be tricky to implement nicely in practice so I don't envy the task.

I think vote by consensus helps solve the problem especially when your worker population is high enough you can duplicate a lot of the work. But that does ultimately limit scaling efficiency with more worker nodes.

Can I ask, have you done any scaling experiments for large models on samples per second or training steps per second with an increasing number of workers, compared to the gold standard environment of a proper HPC cluster running MPI for communication? And also against existing Federated and Split Learning systems?

I realize a crowd structured compute environment is not aiming to hit the raw performance of these environments but I think these scalability comparisons would give a strong baseline to compare off of, and also to see future improvements against.

3

hx-zero OP t1_j03yfov wrote

Yeah, we compared Petals to a server with 3x A100 running tensor-parallel code based on Megatron-DeepSpeed, see the green row in this table. The table also shows how Petals performance degrades if we have concurrent clients and how it compares to offloading.

Adding more servers usually doesn't make the inference speed significantly faster. New servers mostly help with the swarm capacity, so it can provide the speed of ~1 step/sec to a larger number of clients.

I don't think we've done any comparisons with Federated/Split learning systems since, as far as I understand, they mostly don't work well on models of that size (100B+ parameters). But let us know if there're such systems, maybe we will compare Petals to some of them.

4

ReginaldIII t1_j04del0 wrote

Could you help me understand the split labels?

What specifically do you mean by "Offloading on 1x A100" ? Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without the overheads of going through full on Petals?

Is there a difference between "Petals on 3 physical servers" and "Petals on 14 real servers" other than the number?

What you mean by "Petals on 12 virtual servers, simulated on 3x A100" and also by "Same, but with 8 clients running simultaneously" ?

Many thanks :)

1

hx-zero OP t1_j04g7yj wrote

Sure!

Regarding offloading:

  • Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.

  • The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.

  • Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

  • The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
  • The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

  • Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).

3