pommedeterresautee
pommedeterresautee OP t1_j8m2odq wrote
Reply to comment by VP4770 in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
Tks, after some search I found that it's a French practice to use mn (instead of min) and it tends to be replaced, even in France, by min.
For instance: https://www.larousse.fr/dictionnaires/francais/minute/51680
pommedeterresautee OP t1_j8ciq4d wrote
Reply to comment by master3243 in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
As written at top of the post, unfortunately, the way openAI designed Whisper makes it non compliant with PyTorch 2.0
People at OpenAI said they will rework the package when PyTorch 2.0 is released. Then we will be able to optimize it.
pommedeterresautee OP t1_j7yfs4g wrote
Reply to comment by pommedeterresautee in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
forgot to cite the most important paper!!! https://arxiv.org/pdf/1903.07486.pdf
pommedeterresautee t1_j7uwa71 wrote
Reply to comment by Available_Lion_652 in [D] RTX 3090 with i7 7700k, training bottleneck by Available_Lion_652
At start the weights will be moved on the GPU. Then during training, the tokenizer will convert your strings to a int64 tensors. They are quite light, and those are moved to GPU during training. What you need is not the fastest CPU but one which can feed your GPU faster that the data it will consume. In GPT2 case, CPU like 7700 won't be an issue. Image or sounds (TTS, ASR) may have more demanding preprocessing during training.
pommedeterresautee OP t1_j7uml76 wrote
Reply to comment by zzzthelastuser in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
lol unfortunately no, minutes :(
pommedeterresautee OP t1_j7uk761 wrote
Reply to comment by uzibart in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
I just discovered the project https://github.com/ggerganov/whisper.cpp
As written in another comment, there is no way for (recent) CPU (even ARM ones) to be as fast as (recent) GPU on such big model (the list no GPU support in limitations).
That being said, the project looks super cool, tks for the pointer (I ordered a M2 Max, lots of fun to come :-) )
pommedeterresautee OP t1_j7ub975 wrote
Reply to comment by lpatks in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
Nvidia doc obviously (for sass it’s light), and also some old blog posts very detailed (for shuffle instructions etc).
pommedeterresautee OP t1_j7u0p8z wrote
Reply to comment by blackkettle in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
Using CG doesn't affect the output quality.
What works with Whisper will still work with CG+Whisper.
pommedeterresautee OP t1_j7tp663 wrote
Reply to comment by programmerChilli in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
I guess you better know than me :-)
Which part? The dispatcher thing or it's spread on several steps?
pommedeterresautee OP t1_j7tk4fx wrote
Reply to comment by SnooHesitations8849 in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
On large DL models like Whisper large, CPU is never on par with GPUs because CPU is latency oriented hardware and GPU is throughput oriented. The only ways large models are run on CPUs is by reducing the number of operations to perform like by sparsification or pruning.
Moreover, PyTorch is mostly C++ with a Python layer over it (for now at least, PyTorch 2.0 may be a start of change in this architecture). The Python layer brings most of the PyTorch latency.
And then, even C++ engine launching operations on GPU can not be on par with CUDA graphs (most of the time at least), because you have still to send instruction at a time, and there is still some latency overhead associated in running things that way, just much less than Python. With CUDA graphs there is almost none at all.There is a second thing not discussed here, it's that the graph of instructions is optimized.
Main drawback of CG is the memory overhead, you need at least to double the space taken for input tensors. On generative models with K/V cache, it matters as explained in this post. Plus you need to copy input tensors, which offsets a -very-small part of the gains (at least that s what we saw in our tests on Whisper and Bert / Roberta).
That is why TensorRT (a big C++ piece) for instance supports CUDA graphs.
Still, TBH, as you pointed out, the most important thing is that ... it's easier to build and run :-)
pommedeterresautee t1_j289hxw wrote
Reply to comment by ThePerson654321 in [R] LAMBADA: Backward Chaining for Automated Reasoning in Natural Language - Google Research 2022 - Significantly outperforms Chain of Thought and Select Inference in terms of prediction accuracy and proof accuracy. by Singularian2501
Why? The improvement seems quite significant.
pommedeterresautee t1_izu96au wrote
Reply to comment by spaccetime in [D] Does Google TPU v4 compete with GPUs in price/performance? by Shardsmp
Why do you say TPU is not for experimental usage?
pommedeterresautee t1_iyza0g2 wrote
I know very little about CPUs, but wondering why do you think more cache would help?
Intuitively I would think it would be the case if training was for most of the time memory bandwidth limited but the issue with CPUs (vs GPUs) is that during training, model is computed bounded.
pommedeterresautee OP t1_iuh0y8u wrote
Reply to comment by fakesoicansayshit in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
I think so but not tried. Requires to write search / replace patterns
pommedeterresautee t1_iuaodj2 wrote
Reply to comment by big_dog_2k in [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
Yes for Ampere.
For HF models, the Kernels will work for most of them out of the box but you need to have search replace patterns for your specific architecture. That's why we do not have our own implementations of X and Y.
Check https://github.com/ELS-RD/kernl/blob/main/src/kernl/optimizer/linear.py for an example.
pommedeterresautee t1_iuacchj wrote
Reply to comment by big_dog_2k in [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
To mitigate precision issues:
- on ONNX related engines, we built a tool to check the output of each node and tag those that won't behave well in fp16 or bf16. Described here: https://www.reddit.com/r/MachineLearning/comments/uwkpmt/p_what_we_learned_by_making_t5large_2x_faster/
- on Kernl, we "just" understand what happens as the code is simple (and we wrote it). We choose to not do terrible things to make the inference faster, basically no approx in our kernels, and accumulation is in fp32 (basically it's even better than vanilla mixed precision, and still much faster). IMO that's the most robust approach...
pommedeterresautee t1_iu9zg8x wrote
Reply to [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
Hi, author of transformer deploy and Kernl here. Whatever option you choose, something to keep in mind next to speed is being able to maintain precision output. I can tell you it’s our number one pain point, on both tensorrt and onnx runtime. We have even built some tooling to help on that, it helped but it’s not yet perfect. Triton inference server is really a cool option with a good documentation.
pommedeterresautee t1_iu9vc6f wrote
Hi, I am one of the authors of transformer deploy. I have seen you have copied most of the files for the transformer part. That’s really cool, I really appreciate you kept the licenses, may I ask you to cite our work in the Readme ?
Moreover, if I may, why did you copied instead of just importing a dependency? You would get the maintenance for free :-)
pommedeterresautee OP t1_iu2vg8y wrote
Reply to comment by seek_it in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
Right now the kernels cover linear layer, attention, and layer norm / rms norm. So the effect would be limited outside a transformer or assimilated. However we will increase the number of kernels, but convolution is not right now our priority
pommedeterresautee OP t1_itv3bu7 wrote
Reply to comment by programmerChilli in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
Yeah, it doesn't make sense to me either. Also I was expecting a bit better speedup (regarding those shared on the PyTorch dev forum). I tried several combinations of params (enabling the disabled optimizations) but they were either broken (eg matmul ops template) or making things slower.
Scripts are here: https://github.com/ELS-RD/kernl/tree/main/experimental/benchmarks
Let me know if you find something suspicious.
pommedeterresautee OP t1_ituya3z wrote
Reply to comment by Lolologist in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
I have not used Spacy since years but my understanding is that for large models they leverage Hugging Face library (https://spacy.io/universe/project/spacy-transformers), so I would say it should work out of the box, the only thing is to catch the model instance and override it with the optimized version (it will take the very same input).
Maybe a redditer with more Spacy knowledge than I have can validate the approach...
pommedeterresautee OP t1_itu6k3n wrote
Reply to comment by Sylv__ in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
Thank you, if you try it, don't hesitate to share your feedback with us
pommedeterresautee OP t1_itu4mr6 wrote
Reply to comment by ganzzahl in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
>Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators?
To be honest, we had the very same question :-)
CUDA is powerful... and verbose. To target several generations of hardware you need some deep knowledge of their characteristics. I have many times followed people from Microsoft on a PR implementing some new model, it takes them often 1 month or more. On TensorRT I suppose it's even harder as they generate code but hey, it's a black box. For best perf, CUDA code could be good, but you need nvcc to generate the right set of PTX instructions to reach peak perf which is not always the case from what I saw.
Hopefully, people of Nvidia working on Cutlass try to make those things easier by taking care of the lowest level of Cuda implementations. The lib is not, right now, what you would call, easy to grasp but you really learn a lot by working with it (much more than starting from scratch as you see what is the right way to implement stuff).
There are several reasons why you don't see more Triton:
- many people work with it but not in OSS (Anthropic, OpenAI, etc.). You can guess through issues and repo stars that the language is growing faster and faster since a few months
- educative material ... could be more smooth, it's a bit first tuto (add 2 vecs) is boringly simple, on matmul one there is a block you need to look during long minutes to understand what it does, and fused attention, it took us days to understand each line... and realize that it was not really the Flash Attention paper (like one of us implemented the paper, the other worked on Triton example and we were arguing during days about everything until we realized that it was not parallelized at the same level...).
Things will change, PyTorch has choose Triton language as their default one to compile GPU models for future PyTorch version (I guess version 1.14, not sure). More about it here -> https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747
There are certainly other reasons (like big corps can't rely on other big corps techno without some guarantees, etc.) but I think those above are very important explanations.
To be honest, we have been very surprised by the speedups ourselves, beating TensorRT on long sequences was definitely far above our objectives. Even more crazy when you think we have still margins for more speedups... (like we don't yet tuned blocks sizes on some kernels, etc.)
Let's see where it brings us...
pommedeterresautee OP t1_ittyyn3 wrote
Reply to comment by sam__izdat in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee
Kepler gen is a bit old, but we may increase hardware support in the future.
First Triton is going through a big rewriting and it's expected that some bugs we had to support older devices will be fixed, of course, nothing 100% sure.
Moreover, we plan to (re)explore cutlass which supports at least Tesla hardware (but they said that their -new- work will only target >= Ampere devices).
pommedeterresautee OP t1_ja26tgi wrote
Reply to comment by stevevaius in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee
Our work is for GPU with capacity >= 80 (A10, A100, 3090RTX, etc.) . On Colab you will likely get a T4, etc. (75). Your best bet is to copy paste related to CUDA graph from Kernl library and use with PyTorch 2.0 nightly.