Hello all,

I've been looking at using a small neural network (~500 trainable weights) implemented in PyTorch to solve a regression problem where my features and targets are originally stored to double precision. I've experimented with setting both my features + targets and my NN weights to single and double precision, but have noticed only a negligible difference in the time it takes to train the model over a fixed number of epochs. I've found this to be true when training on Cpu (AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz) or Gpu (RTX 2070 Super, CUDA 11.7).

I've also done this experiment for training a MLP on the Fashion MNIST data set exactly as described in the PyTorch quickstart tutorial (https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html). In this case, I have also found that the per-epoch training time is only decreased from 5.1 to 4.8 seconds when training a model with double vs. single precision weights on my Gpu.

I'm wondering why I don't see a more significant training time difference using these different precisions? Any insight would be appreciated!

Comments

You must log in or register to comment.

sutlusalca t1_isbclm9 wrote on October 14, 2022 at 5:41 PM

#103,679

Story: You went to the supermarket and bought a bottle of milk. Next day, you went again and bought two bottles of milk. You just spent a few seconds more for buying one more milk.

suflaj t1_isbgg32 wrote on October 14, 2022 at 6:07 PM

#103,888

Probably because the startup overhead dominates over the processing time. 500 weights is not really something you can apply to real life, as modern neural networks are 100+ million parameters for consumer hardware, and not on a dataset which is considered solved.

_Arsenie_Boca_ t1_isbgyhs wrote on October 14, 2022 at 6:10 PM

#103,910

If the hardware is optimized for it, there probably is not a huge difference in speed, but the performance gain is probably negligible too.

The real reason people dont use 64bit is mainly memory usage. When you train a large model, you can fit much bigger 32bit/16bit batches into memory and thereby speed up training.