Submitted by hx-zero t3_zl03b0 in MachineLearning
SleekEagle t1_j07bxyi wrote
I thought distributed training over the internet was prohibitively slow due to communication overhead - wouldn't you run into the same issue when fine-tuning? If anyone could ELI5 why/how this works that would be awesome!
hx-zero OP t1_j07d431 wrote
Training from scratch is slow because you need to synchronize all model weights/gradients on each step (though it's possible for somewhat smaller models with some optimizations).
In case of fine-tuning (especially prompt tuning), you train only a small percent of weights, so communication overhead is not that huge anymore. Still, this allows to adapt the LM to most downstream tasks.
SleekEagle t1_j083jkl wrote
Got it, thanks for the explanation!
Viewing a single comment thread. View all comments