Viewing a single comment thread. View all comments

ggerganov OP t1_irv0mle wrote

Hi, yes - I'm using SIMD intrinsics. AVX2 on x86 and NEON on ARM.

I am taking advantage of F16 floating-point arithmetic if available. Otherwise, I use it just as storage type to reduce memory bandwidth.

5

ThisIsMyStonerAcount t1_irvmont wrote

so you rewrote all matrix products, without using BLAS?

EDIT: if so: why not use OpenBLAS instead (which afaik supports fp16 and bf16, too)?

6

ggerganov OP t1_irw8eho wrote

Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.

I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.

I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.

Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.

One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.

4

ThisIsMyStonerAcount t1_irx8urr wrote

So, in case you're not aware, matrix-matrix multiplication is THE workhorse of every BLAS implementation. I'm not too familiar with the Accelerate framework, but the really good implementations (e.g. MKL from Intel, or OpenBLAS) are extremely highly optimized (as in: there are people who are working on this professionally for years as their main job). You're very unlikely to get close to their performance, and shouldn't feel bad if they beat you by a lot.

I'd suggest giving OpenBLAS a whirl if you want to optimize for the absolute top achievable speeds. It's the best free BLAS implementation out there. For learning, googling for "cache optimized gemm" will give you good starting points on techniques for achieving SOTA performance in matrix-matrix multiplication.

2