Submitted by ggerganov t3_y0nvqu in MachineLearning
ggerganov OP t1_irw8eho wrote
Reply to comment by ThisIsMyStonerAcount in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov
Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.
I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.
I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.
Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.
One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.
ThisIsMyStonerAcount t1_irx8urr wrote
So, in case you're not aware, matrix-matrix multiplication is THE workhorse of every BLAS implementation. I'm not too familiar with the Accelerate framework, but the really good implementations (e.g. MKL from Intel, or OpenBLAS) are extremely highly optimized (as in: there are people who are working on this professionally for years as their main job). You're very unlikely to get close to their performance, and shouldn't feel bad if they beat you by a lot.
I'd suggest giving OpenBLAS a whirl if you want to optimize for the absolute top achievable speeds. It's the best free BLAS implementation out there. For learning, googling for "cache optimized gemm" will give you good starting points on techniques for achieving SOTA performance in matrix-matrix multiplication.
Viewing a single comment thread. View all comments