Remi_Coulom t1_izfukyo wrote on December 8, 2022 at 8:18 PM

Reply to comment by suflaj in What framework can I use to quantize a deep learning model to specific bit-widths? by MahmoudAbdAlghany

NVIDIA's tensor cores support 4-bit, 2-bit and 1-bit operation. I am very surprised no popular library takes advantage of this possibility. Here is a 3-year-old blog post about using 4-bit inference: https://developer.nvidia.com/blog/int4-for-ai-inference/

suflaj t1_izfw75x wrote on December 8, 2022 at 8:28 PM

They do, but they use bigger registers, so ultimately, unless you can hand optimize it to pack operations together, you will have no benefit from it. That would at least imply writing your own CUDA kernels.

Furthermore, 8 bit is already often too small to be stable. Why go lower? If you want garbage outputs, you could always fit whatever task on a smaller model. It's easier to cut model size in half and use 8-bit or 4x and use 16-bit, than to make 4 bit or lower work.

At this point in time, TensorRT seems to be the best you'll get for as little involvement as possible. Based on benchmarks, it also seems to outperform INT4 precision by a significant margin. The only drawback is its license, which implicitly prevents commercial use.

horselover_f4t t1_izibm6r wrote on December 9, 2022 at 8:54 AM

Can I ask you what you mean by "implicitly prevents"?
https://github.com/NVIDIA/TensorRT/blob/main/LICENSE seems to permit commercial use, do you refer to trademarks?

suflaj t1_izihg01 wrote on December 9, 2022 at 10:18 AM

This is only the code license for the open source portion, but the SDK license of the general, proprietary software that TensorRT is, is also something you have to agree on: https://docs.nvidia.com/deeplearning/tensorrt/sla/index.html

In there, ownership is phrased in such an ambiguous way the legal team of a company would probably never greenlight using it.

horselover_f4t t1_izik5mz wrote on December 9, 2022 at 10:56 AM

I will have to check that out, thank you!