Submitted by minimaxir t3_11fbccz in MachineLearning
LetterRip t1_jal4vgs wrote
Reply to comment by cv4u in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
Yep, or a mix between the two.
GLM-130B quantized to int4, OPT and BLOOM int8,
https://arxiv.org/pdf/2210.02414.pdf
Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),
Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS
https://arxiv.org/pdf/2103.13630.pdf
Here is a talk on auto48 (automatic mixed int4/int8 quantization)
https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/
londons_explorer t1_jam6oyr wrote
Aren't biases only a tiny tiny fraction of the total memory usage? Is it even worth trying to quantize them more than weights?
londons_explorer t1_jam6r8g wrote
Don't you mean the other way around?
tomd_96 t1_jamp6kt wrote
Where was this introduced?
Viewing a single comment thread. View all comments