Submitted by ggerganov t3_y0nvqu in MachineLearning

Recently, I am having fun with re-implementing the inference of various transformer models (GPT-2, GPT-J) in pure C/C++ in order to efficiently run them on a CPU.

The latest one that I ported is OpenAI Whisper for automatic speech recognition:

https://github.com/ggerganov/whisper.cpp

For smaller models I am able to achieve very nice performance.
For example, here is a demonstration of real-time transcription of audio from the microphone:

whisper.cpp running on a MacBook Pro M1 (CPU only)

Hope you find this project interesting and let me know if you have any questions about the implementation.

157

Comments

You must log in or register to comment.

LetterRip t1_irt4luw wrote

You might check DeepSpeed MII, Facebook AITemplate, and Google XNNPACK and see how their CPU conversions compare.

https://github.com/facebookincubator/AITemplate

https://github.com/microsoft/DeepSpeed-MII

https://github.com/google/XNNPACK

and see how those compare,

also

18

ZY0M4 t1_is5u98q wrote

I thought that AITemplate can only run on GPU? But if it supports the CPU in some way, then it would be interesting to test it.

1

mrpogiface t1_irwjphh wrote

How much effort would it be to get this running in WASM / the browser?

4

ggerganov OP t1_irwlluz wrote

I was thinking about this too.

Compiling the code is easy. The problem is you need to load 75 MB model data (this is the "tiny" model). I guess nobody would want to download 75 MB every time they load a page.

Even if we say you are OK with 75 MB assets, the next problem is WASM not supporting SIMD. So the performance would be much worse compared to native. How much? Not sure.

But nevertheless - it might be fun to try and run it in the browser.

5

mrpogiface t1_irz4hc9 wrote

As a complete WASM novice, I'd appreciate you doing it as a learning exercise for me :) But yeah, everything you outlined makes sense.

1

zzzthelastuser t1_is0oiz5 wrote

I THINK it is possible to keep the file cached. So that if a user returns to the site the model doesn't need to be re-downloaded again.

Alternatively a user could download the model file manually and your website asks the user to drag and drop their model file to launch the service?

1

justgord t1_irtkq5c wrote

Nice work .. are you using some 16bit float to help speed things up ?

but no AVX/SSE ?

1

ggerganov OP t1_irv0mle wrote

Hi, yes - I'm using SIMD intrinsics. AVX2 on x86 and NEON on ARM.

I am taking advantage of F16 floating-point arithmetic if available. Otherwise, I use it just as storage type to reduce memory bandwidth.

5

ThisIsMyStonerAcount t1_irvmont wrote

so you rewrote all matrix products, without using BLAS?

EDIT: if so: why not use OpenBLAS instead (which afaik supports fp16 and bf16, too)?

6

ggerganov OP t1_irw8eho wrote

Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.

I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.

I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.

Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.

One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.

4

ThisIsMyStonerAcount t1_irx8urr wrote

So, in case you're not aware, matrix-matrix multiplication is THE workhorse of every BLAS implementation. I'm not too familiar with the Accelerate framework, but the really good implementations (e.g. MKL from Intel, or OpenBLAS) are extremely highly optimized (as in: there are people who are working on this professionally for years as their main job). You're very unlikely to get close to their performance, and shouldn't feel bad if they beat you by a lot.

I'd suggest giving OpenBLAS a whirl if you want to optimize for the absolute top achievable speeds. It's the best free BLAS implementation out there. For learning, googling for "cache optimized gemm" will give you good starting points on techniques for achieving SOTA performance in matrix-matrix multiplication.

2

Lirezh t1_iu22kus wrote

That's extremely interesting ..
Did you do any performance benchmarks to compare the python code with your C++ implementation ?
Did you consider porting BLIP and CLIP to C++ ?

1

upperfloormaster t1_iruyu5a wrote

So you've benchmarked your impl against existing ones, and the results were precisely "very nice performance"% all across the board.

I see.

−3

ggerganov OP t1_irv0gki wrote

Here are some benchmarks that other people did (both vs CPU and vs GPU):

- vs OpenVINO + ONNX on CPU - more than 2x faster

https://github.com/openai/whisper/discussions/208#discussioncomment-3827022

- vs PyTorch (CPU: i7 11800H, GPU: RTX 3080 Laptop):

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

- whisper.cpp on Xeon processor

https://github.com/ggerganov/whisper.cpp/issues/16

Also, my implementation is focused for performance on M1 chips and it looks like most of the Python frameworks do not support it properly yet, so I cannot make a proper benchmark.

Additionally, my implementation can also run the "large" model on an android phone (Samsung A52) - would be interesting to see how this compares with existing implementations:

https://github.com/ggerganov/whisper.cpp/issues/18#issue-1395784900

8