Is there a way to do quantization in native pytorch for GPUs (Cuda)?

I know that TensorRT offers this functionality, but I would prefer working with native pytorch code. I understand from the pytorch docs, https://pytorch.org/docs/stable/quantization.html, that quantization for the GPU is linked to TensorRT. Given that Nvidia GPUs offer quantization for some time now, it's find it difficult to believe that no other solid implementation for quantization other than TensorRT exists. Grateful for any pointer or suggestions.

Comments

You must log in or register to comment.

_Arsenie_Boca_ t1_j2xrj3r wrote on January 4, 2023 at 6:15 PM

#1,280,344

I'm not an expert here, but as far as I understand from the docs, quantization is not yet a mature feature.

Im curious, what is the reason you dont want TensorRT?

faschu OP t1_j2xt5i6 wrote on January 4, 2023 at 6:24 PM

#1,280,421

Replying to _Arsenie_Boca_ (#1,280,344)

Thanks for the reply!

I personally find TensorRT hard to debug and I prefer to use it only in production when I'm absolutely sure that the model produces the desired results.

Sylv__ t1_j2yrga7 wrote on January 4, 2023 at 9:52 PM

#1,281,938

Replying to faschu (#1,280,421)

Well, you can always debug / try quantization configs with fake quantization on GPU. And once one is good enough for you, move to TensorRT, although AFAIK the support in TRT is very limited. Of course, this will only allow you to benchmark configs for prediction quality, not speedup.

Maybe there will be a support for quantized kernels in torchinductor? I recall reading around this in a github issue at some point.

Otherwise you could try bitsandbytes, and pass the good param to do all computations in 8-bit.

The authors of SmoothQuant implemented as well torch-int, which is a wrapper around CUTLASS to use int8 GEMM. You can find it on github!

jakderrida t1_j300s6g wrote on January 5, 2023 at 2:56 AM

#1,284,203

Quantization-aware training: PyTorch provides a set of APIs for performing quantization-aware training, which allows you to train a model with quantization in mind and can often result in higher-quality quantized models. You can find more information about quantization-aware training in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#quantization-aware-training).

Post-training static quantization: PyTorch also provides APIs for performing post-training static quantization, which involves quantizing a model that has already been trained. You can find more information about post-training static quantization in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#post-training-static-quantization).

Dynamic quantization: PyTorch also supports dynamic quantization, which allows you to quantize a model at runtime. This can be useful for applications where the model needs to be deployed on devices with limited memory or computational resources. You can find more information about dynamic quantization in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#dynamic-quantization).

faschu OP t1_j30wfpm wrote on January 5, 2023 at 7:48 AM

#1,285,747

Replying to jakderrida (#1,284,203)

Thanks for the reply. But do these three quantization techniques work on the GPU without TensorRT? The supported backends led me to believe they are not: https://pytorch.org/docs/stable/quantization.html#backend-hardware-support

faschu OP t1_j3q0sr7 wrote on January 10, 2023 at 7:21 AM

#1,323,171

Replying to Sylv__ (#1,281,938)

Thanks a lot for the detailed reply! I will try these suggestions.