Ulfgardleo t1_ir9xy3t wrote on October 6, 2022 at 1:01 PM

You seem to be confused.

Experiment 1 uses small 5x5 matrices. Not block-matrices. There they only count the number of mults. These are not faster than SIMD implementations of 5x5 matrix mults, otherwise they would have shown it off proudly.
Experiment 2 was about 4x4 block-matrices. But here the 10-20% faster than the COMMONLY used algorithms is actually an overstatement of the results. For GPUs, their implementation is only 5% faster than their default jax implementation of Strassen. The difference to TPU could just mean that their Jax compiler sucks for TPUs. (//Edit: by now i low-key assume that the 10-20% refers to standard cBLAS because i do not get 20% compared to strassen for any result in Figure 5 (and how could they, because they never even get more than 20% improvement over cBLAS.))
They do not cite any of the papers that are concerned with efficient implementation of strassen. Especially the efficient memory scheme, from 1994. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 it is unclear whether a GPU implementation of that would be faster, since they are not even discussing the GPU implementation of their strassen variant. They do not claim that their algorithm is faster in complexity, so we are completely reliant on that their implementation of strassen makes sense.