Viewing a single comment thread. View all comments

hx-zero OP t1_j04g7yj wrote

Sure!

Regarding offloading:

  • Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.

  • The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.

  • Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

  • The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
  • The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

  • Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).

3