Viewing a single comment thread. View all comments

SleekEagle t1_j07bxyi wrote

I thought distributed training over the internet was prohibitively slow due to communication overhead - wouldn't you run into the same issue when fine-tuning? If anyone could ELI5 why/how this works that would be awesome!

1

hx-zero OP t1_j07d431 wrote

Training from scratch is slow because you need to synchronize all model weights/gradients on each step (though it's possible for somewhat smaller models with some optimizations).

In case of fine-tuning (especially prompt tuning), you train only a small percent of weights, so communication overhead is not that huge anymore. Still, this allows to adapt the LM to most downstream tasks.

2