AlmightySnoo t1_is5twx8 wrote on October 13, 2022 at 2:53 PM

You're memory-bound on neural network problems as frameworks usually perform multiple load/stores from/in the GPU's global RAM at each activation/layer. Operator fusion as done for example by PyTorch's JIT compiler helps a bit but it cannot fuse operators with a matrix multiplication since the latter is usually done using cuBLAS. NN frameworks need to rethink this "okay efficient matrix multiplication algos aren't trivial so let's delegate this to a blackbox code like cublas" mentality as I think it's a shameful waste of chip power and caps the potential of GPUs.

programmerChilli t1_is7vgbp wrote on October 13, 2022 at 10:57 PM

I mean... it's hard to write efficient matmuls :)

But... recent developments (i.e. CuBLAS and Triton) do allow NN frameworks to write efficient matmuls, so I think you'll start seeing them being used to fuse other operators with them :)

You can already see some of that being done in projects like AITemplate.

I will note one other thing though - fusing operators with matmuls is not as big of a bottleneck in training, this optimization primarily helps in inference.