Submitted by Shardsmp t3_zil35t in MachineLearning
Fellow machine learning enthusiast here!
I want to train a large NLP model and I'm wondering whether its worth it to use Google Cloud's TPU's for it. I already have an Nvidia RTX 3060 Laptop GPU with 8.76 TFLOPS, but I was unable to find out what the exact performance (in TFLOPS to be able to compare them) of google TPU v3 and v4 are.
I know TPUs (I think the factor is 12x) are a ton faster and more optimized for machine learning than GPU's, but I'm still wondering whether its worth it to just build a graphics card rig for the long term. (since the pricing and estimation seems unclear to me since I cannot see how much I'm paying per TFLOP.)
Has anyone done the numbers on price/performance and hourly cost? Also is there any factor I missed? Thanks a lot in advance!
spaccetime t1_izsnm7n wrote
8x NVIDIA A100 = 25$/hour
TPU v3-4 = 8$/hour
TPU v4-4 = 12$/hour
When training BERT on 27B tokens I measured faster training times when using the TPU.
Nvidias’ GPUs are great for Deep Learning, but DL is not what they are designed for. They have CUDA cores or even RT-cores. You pay extra for being good at rendering, but you don’t use this or use it just just a little when training deep learning models.
Google’s TPU is engineered only for Deep Learning. The MXU is unrivaled.
For short term usage take the TPU and for long term a DGX station or another cluster.
TPU is not for experimental usage. Use it only when you are sure that your model, data and parameterization make sense.