Submitted by alexnasla t3_yikumt in MachineLearning
Hey there,
So im trying to figure out how to significantly speed up my training (trying to 10x it) and im trying to figure out whats going on here. Im using PyTorch for framework and 4 sequential layers, Dense+conv1d+lstm+dense. I have a batch size of 80,000 and ran it on a K80 vs A100, I only saw a 14% increase in performance. In the given time frame the K80 completed 1400 Epochs and the A100 completed about 1600 Epochs. To me this likely means what im trying to do is NOT being bound by the GPU at all, as the hardware should have accounted for something like a 30x increase in performance yeah? I dont think RAM is the issue, the A100 has 80GB of HMB2 VRAM, more than what I ever use. So if its not GPU power, and not RAM. Its either CPU or Storage?
It seems I need to parallelize the training in order to get the speed im looking for?
Anyone have any insight?
Kon-kkk t1_iuj7nyz wrote
Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.