neanderthal_math t1_ir7l0k3 wrote
Reply to comment by Ulfgardleo in [R] Discovering Faster Matrix Multiplication Algorithms With Reinforcement Learning by EducationalCicada
In practice, do libraries like CUDA and MKL do Matrix multiplication the standard way or do they have fancy decompositions?
I remember when I was young, the atlas library would look at your hardware and do a bunch of matmuls and figure out what the “optimal” configuration would be for your system.
Ulfgardleo t1_ir7lytl wrote
All Standard unless very large. Atlas is just picking different kernels that "only" change order of operations to maximize CPU utilization.
Red-Portal t1_ir7xeyo wrote
The funny thing is that the lesson of ATLAS and OpenBLAS was that, matrix multiplication optimized to the assembly level by humans is still the best way to squeeze out performance.
harharveryfunny t1_ira5afy wrote
cuDNN supports Winograd on CUDA cores (not sure about Tensor cores) for convolution, but only for certain filter sizes such as 3x3.
Viewing a single comment thread. View all comments