Viewing a single comment thread. View all comments

suflaj t1_izfke61 wrote

There are none, unless you plan on emulating them, which you'd have to do yourself.

The available quantization widths correspond to what the hardware is capable of doing, and hardware generally revolves around widths that have bytes as their base length.

4

Remi_Coulom t1_izfukyo wrote

NVIDIA's tensor cores support 4-bit, 2-bit and 1-bit operation. I am very surprised no popular library takes advantage of this possibility. Here is a 3-year-old blog post about using 4-bit inference: https://developer.nvidia.com/blog/int4-for-ai-inference/

2

suflaj t1_izfw75x wrote

They do, but they use bigger registers, so ultimately, unless you can hand optimize it to pack operations together, you will have no benefit from it. That would at least imply writing your own CUDA kernels.

Furthermore, 8 bit is already often too small to be stable. Why go lower? If you want garbage outputs, you could always fit whatever task on a smaller model. It's easier to cut model size in half and use 8-bit or 4x and use 16-bit, than to make 4 bit or lower work.

At this point in time, TensorRT seems to be the best you'll get for as little involvement as possible. Based on benchmarks, it also seems to outperform INT4 precision by a significant margin. The only drawback is its license, which implicitly prevents commercial use.

1

horselover_f4t t1_izibm6r wrote

Can I ask you what you mean by "implicitly prevents"?
https://github.com/NVIDIA/TensorRT/blob/main/LICENSE seems to permit commercial use, do you refer to trademarks?

1

suflaj t1_izihg01 wrote

This is only the code license for the open source portion, but the SDK license of the general, proprietary software that TensorRT is, is also something you have to agree on: https://docs.nvidia.com/deeplearning/tensorrt/sla/index.html

In there, ownership is phrased in such an ambiguous way the legal team of a company would probably never greenlight using it.

2