Comments

You must log in or register to comment.

suflaj t1_izfke61 wrote

There are none, unless you plan on emulating them, which you'd have to do yourself.

The available quantization widths correspond to what the hardware is capable of doing, and hardware generally revolves around widths that have bytes as their base length.

4

suflaj t1_izfw75x wrote

They do, but they use bigger registers, so ultimately, unless you can hand optimize it to pack operations together, you will have no benefit from it. That would at least imply writing your own CUDA kernels.

Furthermore, 8 bit is already often too small to be stable. Why go lower? If you want garbage outputs, you could always fit whatever task on a smaller model. It's easier to cut model size in half and use 8-bit or 4x and use 16-bit, than to make 4 bit or lower work.

At this point in time, TensorRT seems to be the best you'll get for as little involvement as possible. Based on benchmarks, it also seems to outperform INT4 precision by a significant margin. The only drawback is its license, which implicitly prevents commercial use.

1