Submitted by alexnasla t3_yikumt in MachineLearning

Hey there,

So im trying to figure out how to significantly speed up my training (trying to 10x it) and im trying to figure out whats going on here. Im using PyTorch for framework and 4 sequential layers, Dense+conv1d+lstm+dense. I have a batch size of 80,000 and ran it on a K80 vs A100, I only saw a 14% increase in performance. In the given time frame the K80 completed 1400 Epochs and the A100 completed about 1600 Epochs. To me this likely means what im trying to do is NOT being bound by the GPU at all, as the hardware should have accounted for something like a 30x increase in performance yeah? I dont think RAM is the issue, the A100 has 80GB of HMB2 VRAM, more than what I ever use. So if its not GPU power, and not RAM. Its either CPU or Storage?

It seems I need to parallelize the training in order to get the speed im looking for?

Anyone have any insight?

5

Comments

You must log in or register to comment.

Kon-kkk t1_iuj7nyz wrote

  1. What framework
  2. What kind of network/model
  3. Try to reduce the CPU-GPU data transitions during training.

Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.

2

fnbr t1_iuj8h11 wrote

Have you profiled your code? That would be the first thing I would do.

What sort of utilization of the GPU are you getting?

It's likely you're bottlenecked by feeding data in- for supervised learning, that's often the case.

I'm happy to offer suggestions for feeding data in if you're using Tensorflow/JAX.

10

alexnasla OP t1_iuj8se6 wrote

Oh my bad!

  1. PyTorch
  2. Its 4 sequential layers, Dense+conv1d+lstm+dense
  3. Hmm any resources you know of I can check out to learn more about doing that?
2

alexnasla OP t1_iuj9596 wrote

So right now the bottleneck is such where I need to speed up the training time to about 10 times to be able to match the sampling time that with the training time and to be able to sample and train at the same time without the bottleneck.

−3

alexnasla OP t1_iujbukx wrote

Im pretty sure its running on the GPU. I dont remember what the GPU utilization was though, ill take a look when I get a chance.

The test that I mentioned ran for 8 hours.

1

patient_zer00 t1_iujl1if wrote

Disc IO is often a bootleneck.

Also, even though using a GPU will increase training speed with LSTMs, too, the computation of the gradient relies on the whole sequence to be processed each sequence step after the other, which can't be parallelized. That's probably why your speed increase is not that big using a K80 vs a A100.

Edit: typos

4

K-o-s-l-s t1_iujldkh wrote

What are you using to log and monitor your jobs? Knowing CPU, RAM, and GPU utilisation will make this a lot easier to understand.

I agree with the poster above; no appreciable speed up switching between a k80 and an a100 makes me suspect that the GPU is not being utilised at all.

1

JustOneAvailableName t1_iujqrr1 wrote

> Its 4 sequential layers, Dense+conv1d+lstm+dense

I thinks this is not enough to saturate the A100. Try to 10x the batch size by just repeating the data. Useless for training, but it should increase GPU utilization without increasing disk utilization. Handy to confirm the bottleneck

1

Historical_Ad2338 t1_iujw7fv wrote

LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.

5