hx-zero OP t1_j03yfov wrote on December 13, 2022 at 10:12 PM

Reply to comment by ReginaldIII in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Yeah, we compared Petals to a server with 3x A100 running tensor-parallel code based on Megatron-DeepSpeed, see the green row in this table. The table also shows how Petals performance degrades if we have concurrent clients and how it compares to offloading.

Adding more servers usually doesn't make the inference speed significantly faster. New servers mostly help with the swarm capacity, so it can provide the speed of ~1 step/sec to a larger number of clients.

I don't think we've done any comparisons with Federated/Split learning systems since, as far as I understand, they mostly don't work well on models of that size (100B+ parameters). But let us know if there're such systems, maybe we will compare Petals to some of them.

ReginaldIII t1_j04del0 wrote on December 13, 2022 at 11:55 PM

Could you help me understand the split labels?

What specifically do you mean by "Offloading on 1x A100" ? Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without the overheads of going through full on Petals?

Is there a difference between "Petals on 3 physical servers" and "Petals on 14 real servers" other than the number?

What you mean by "Petals on 12 virtual servers, simulated on 3x A100" and also by "Same, but with 8 clients running simultaneously" ?

Many thanks :)

hx-zero OP t1_j04g7yj wrote on December 14, 2022 at 12:15 AM

Sure!

Regarding offloading:

Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.
The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.
Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).