Viewing a single comment thread. View all comments

JanGehlYacht t1_izuzh6o wrote

Adding to this: TPUs are heavily optimized for transformer architectures since Google uses them heavily. Also, you'll see that TPU has a different stack: it has XLA as a compiler so has many more compiler optimizations for ML training on the fly (things like op fusion in CUDA are very beneficial and it comes almost for free with XLA.) It also scales to very large models easily because of its multi-host high-speed interconnects (you'll see most NLP models with GPU are bound at ~340B params while Google published a paper on a 540B model almost a year ago)

You should evaluate this yourself for your workload though: If you are training a large NLP model you'll be paying a lot anyway. So, you should really train your model for an hour on whatever options you're thinking and compare what QPS you get for the $ you pay. Then, you can pick a stack and get going to train the model fully.

Likely, given that TPUs only job is ML, you'll find out that for most ML workloads it'll be heavily optimized.

7