Recently, I am having fun with re-implementing the inference of various transformer models (GPT-2, GPT-J) in pure C/C++ in order to efficiently run them on a CPU.

The latest one that I ported is OpenAI Whisper for automatic speech recognition:

https://github.com/ggerganov/whisper.cpp

For smaller models I am able to achieve very nice performance.
For example, here is a demonstration of real-time transcription of audio from the microphone:

whisper.cpp running on a MacBook Pro M1 (CPU only)

Hope you find this project interesting and let me know if you have any questions about the implementation.

Comments

You must log in or register to comment.

LetterRip t1_irt4luw wrote on October 10, 2022 at 8:57 PM

You might check DeepSpeed MII, Facebook AITemplate, and Google XNNPACK and see how their CPU conversions compare.

https://github.com/facebookincubator/AITemplate

https://github.com/microsoft/DeepSpeed-MII

https://github.com/google/XNNPACK

and see how those compare,

also

SEND_ALL_DOG_PICS t1_irwqg0s wrote on October 11, 2022 at 4:45 PM

I would look into Torchdynamo before those.

ZY0M4 t1_is5u98q wrote on October 13, 2022 at 2:55 PM

I thought that AITemplate can only run on GPU? But if it supports the CPU in some way, then it would be interesting to test it.

LetterRip t1_is5y8z0 wrote on October 13, 2022 at 3:22 PM

There is a CPU implementation of HIP that runs unmodified HIP code,

https://github.com/ROCm-Developer-Tools/HIP-CPU

ZY0M4 t1_is64tbr wrote on October 13, 2022 at 4:05 PM

Seems some interesing. Thanks! Will try this

mrpogiface t1_irwjphh wrote on October 11, 2022 at 4:00 PM

How much effort would it be to get this running in WASM / the browser?

ggerganov OP t1_irwlluz wrote on October 11, 2022 at 4:13 PM

I was thinking about this too.

Compiling the code is easy. The problem is you need to load 75 MB model data (this is the "tiny" model). I guess nobody would want to download 75 MB every time they load a page.

Even if we say you are OK with 75 MB assets, the next problem is WASM not supporting SIMD. So the performance would be much worse compared to native. How much? Not sure.

But nevertheless - it might be fun to try and run it in the browser.

mrpogiface t1_irz4hc9 wrote on October 12, 2022 at 2:50 AM

As a complete WASM novice, I'd appreciate you doing it as a learning exercise for me :) But yeah, everything you outlined makes sense.

ggerganov OP t1_itcgtpx wrote on October 22, 2022 at 4:13 PM

Hey, just in case you are still interested - today I finished the WASM port and the performance not really bad:

https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.wasm

There is a link to a live demo page where you can play with it.

Cheers!

mrpogiface t1_itch4j5 wrote on October 22, 2022 at 4:15 PM

I am extremely interested! I'm excited to learn from it, thank you :)

ggerganov OP t1_is1tgok wrote on October 12, 2022 at 6:06 PM

Looks like WASM actually support SIMD:

https://emscripten.org/docs/porting/simd.html

Will definitely give this a try when I get some free time. I will post updates here, if you are interested in the progress:

https://github.com/ggerganov/whisper.cpp/issues/44

zzzthelastuser t1_is0oiz5 wrote on October 12, 2022 at 1:29 PM

I THINK it is possible to keep the file cached. So that if a user returns to the site the model doesn't need to be re-downloaded again.

Alternatively a user could download the model file manually and your website asks the user to drag and drop their model file to launch the service?

MidnightSun_55 t1_irvoeu5 wrote on October 11, 2022 at 12:03 PM

Have you tried using the neural engine of the M1 to some capacity?

ggerganov OP t1_irw7dy6 wrote on October 11, 2022 at 2:37 PM

No. I tried using Metal Performance Shaders (MPS) but was not able to utilize it properly. Here are some notes on this:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j#attempt-to-use-the-m1-gpu

justgord t1_irtkq5c wrote on October 10, 2022 at 10:55 PM

Nice work .. are you using some 16bit float to help speed things up ?

but no AVX/SSE ?

ggerganov OP t1_irv0mle wrote on October 11, 2022 at 6:40 AM

Hi, yes - I'm using SIMD intrinsics. AVX2 on x86 and NEON on ARM.

I am taking advantage of F16 floating-point arithmetic if available. Otherwise, I use it just as storage type to reduce memory bandwidth.

ThisIsMyStonerAcount t1_irvmont wrote on October 11, 2022 at 11:46 AM

so you rewrote all matrix products, without using BLAS?

EDIT: if so: why not use OpenBLAS instead (which afaik supports fp16 and bf16, too)?

ggerganov OP t1_irw8eho wrote on October 11, 2022 at 2:44 PM

Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.

I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.

I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.

Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.

One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.

ThisIsMyStonerAcount t1_irx8urr wrote on October 11, 2022 at 6:43 PM

So, in case you're not aware, matrix-matrix multiplication is THE workhorse of every BLAS implementation. I'm not too familiar with the Accelerate framework, but the really good implementations (e.g. MKL from Intel, or OpenBLAS) are extremely highly optimized (as in: there are people who are working on this professionally for years as their main job). You're very unlikely to get close to their performance, and shouldn't feel bad if they beat you by a lot.

I'd suggest giving OpenBLAS a whirl if you want to optimize for the absolute top achievable speeds. It's the best free BLAS implementation out there. For learning, googling for "cache optimized gemm" will give you good starting points on techniques for achieving SOTA performance in matrix-matrix multiplication.

Comfortable_Slip4025 t1_irtoqm7 wrote on October 10, 2022 at 11:27 PM

That's something I might find quite useful.

Fit_Schedule5951 t1_iruexcg wrote on October 11, 2022 at 2:51 AM

Interesting, what's the latency vs openai python?

ggerganov OP t1_irv1b8s wrote on October 11, 2022 at 6:49 AM

Here is a comparison for Intel CPU:

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

Would be interesting to compare it on M1 when torch starts supporting F16.

[deleted] t1_is6ucby wrote on October 13, 2022 at 6:51 PM

[removed]

Lirezh t1_iu22kus wrote on October 27, 2022 at 11:49 PM

That's extremely interesting ..
Did you do any performance benchmarks to compare the python code with your C++ implementation ?
Did you consider porting BLIP and CLIP to C++ ?

CommunismDoesntWork t1_irvtvle wrote on October 11, 2022 at 12:54 PM

Have you tried rewriting these in rust?

ggerganov OP t1_irw8n49 wrote on October 11, 2022 at 2:45 PM

Someone already provided Rust bindings to the C-style API:

https://github.com/tazz4843/whisper-rs

upperfloormaster t1_iruyu5a wrote on October 11, 2022 at 6:16 AM

So you've benchmarked your impl against existing ones, and the results were precisely "very nice performance"% all across the board.

I see.

ggerganov OP t1_irv0gki wrote on October 11, 2022 at 6:37 AM

Here are some benchmarks that other people did (both vs CPU and vs GPU):

- vs OpenVINO + ONNX on CPU - more than 2x faster

https://github.com/openai/whisper/discussions/208#discussioncomment-3827022

- vs PyTorch (CPU: i7 11800H, GPU: RTX 3080 Laptop):

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

- whisper.cpp on Xeon processor

https://github.com/ggerganov/whisper.cpp/issues/16

Also, my implementation is focused for performance on M1 chips and it looks like most of the Python frameworks do not support it properly yet, so I cannot make a proper benchmark.

Additionally, my implementation can also run the "large" model on an android phone (Samsung A52) - would be interesting to see how this compares with existing implementations:

https://github.com/ggerganov/whisper.cpp/issues/18#issue-1395784900