LetterRip t1_jal4vgs wrote on March 2, 2023 at 5:04 AM

Reply to comment by cv4u in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Yep, or a mix between the two.

GLM-130B quantized to int4, OPT and BLOOM int8,

https://arxiv.org/pdf/2210.02414.pdf

Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),

Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

https://arxiv.org/pdf/2103.13630.pdf

Here is a talk on auto48 (automatic mixed int4/int8 quantization)

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/

londons_explorer t1_jam6oyr wrote on March 2, 2023 at 12:49 PM

Aren't biases only a tiny tiny fraction of the total memory usage? Is it even worth trying to quantize them more than weights?

londons_explorer t1_jam6r8g wrote on March 2, 2023 at 12:50 PM

Don't you mean the other way around?

tomd_96 t1_jamp6kt wrote on March 2, 2023 at 3:14 PM

Where was this introduced?