SleekEagle t1_j07bxyi wrote on December 14, 2022 at 4:26 PM

I thought distributed training over the internet was prohibitively slow due to communication overhead - wouldn't you run into the same issue when fine-tuning? If anyone could ELI5 why/how this works that would be awesome!

hx-zero OP t1_j07d431 wrote on December 14, 2022 at 4:34 PM

Training from scratch is slow because you need to synchronize all model weights/gradients on each step (though it's possible for somewhat smaller models with some optimizations).

In case of fine-tuning (especially prompt tuning), you train only a small percent of weights, so communication overhead is not that huge anymore. Still, this allows to adapt the LM to most downstream tasks.

SleekEagle t1_j083jkl wrote on December 14, 2022 at 7:22 PM

Got it, thanks for the explanation!