Viewing a single comment thread. View all comments

ReginaldIII t1_j03wlpe wrote

Awesome, thanks for the details!

I like your reputation scaling idea, although dynamic reputation/trust scaling can be tricky to implement nicely in practice so I don't envy the task.

I think vote by consensus helps solve the problem especially when your worker population is high enough you can duplicate a lot of the work. But that does ultimately limit scaling efficiency with more worker nodes.

Can I ask, have you done any scaling experiments for large models on samples per second or training steps per second with an increasing number of workers, compared to the gold standard environment of a proper HPC cluster running MPI for communication? And also against existing Federated and Split Learning systems?

I realize a crowd structured compute environment is not aiming to hit the raw performance of these environments but I think these scalability comparisons would give a strong baseline to compare off of, and also to see future improvements against.

3

hx-zero OP t1_j03yfov wrote

Yeah, we compared Petals to a server with 3x A100 running tensor-parallel code based on Megatron-DeepSpeed, see the green row in this table. The table also shows how Petals performance degrades if we have concurrent clients and how it compares to offloading.

Adding more servers usually doesn't make the inference speed significantly faster. New servers mostly help with the swarm capacity, so it can provide the speed of ~1 step/sec to a larger number of clients.

I don't think we've done any comparisons with Federated/Split learning systems since, as far as I understand, they mostly don't work well on models of that size (100B+ parameters). But let us know if there're such systems, maybe we will compare Petals to some of them.

4

ReginaldIII t1_j04del0 wrote

Could you help me understand the split labels?

What specifically do you mean by "Offloading on 1x A100" ? Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without the overheads of going through full on Petals?

Is there a difference between "Petals on 3 physical servers" and "Petals on 14 real servers" other than the number?

What you mean by "Petals on 12 virtual servers, simulated on 3x A100" and also by "Same, but with 8 clients running simultaneously" ?

Many thanks :)

1

hx-zero OP t1_j04g7yj wrote

Sure!

Regarding offloading:

  • Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.

  • The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.

  • Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

  • The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
  • The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

  • Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).

3