diviramon t1_iyejg7z wrote on November 30, 2022 at 8:24 PM

Reply to comment by CommunismDoesntWork in [R] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Massachusetts Institute of Technology and NVIDIA Guangxuan Xiao et al - Enables INT8 for LLM bigger than 100B parameters including OPT-175B, BLOOM-176B and GLM-130B. by Singularian2501

It is the new Nvidia FP8 data type: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

diviramon t1_iydw5aq wrote on November 30, 2022 at 5:55 PM

Reply to comment by CommunismDoesntWork in [R] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Massachusetts Institute of Technology and NVIDIA Guangxuan Xiao et al - Enables INT8 for LLM bigger than 100B parameters including OPT-175B, BLOOM-176B and GLM-130B. by Singularian2501

Yeah - a quick search showed some attempts on RN50 and Mobilenet, but nothing on transformers (not surprising since INT8 quant for Bert is very hard). However, it seems like all the INT8 focus is shifting towards MF8 (edit FP8) which should be more suitable for training as well.

diviramon t1_iydklcc wrote on November 30, 2022 at 4:41 PM

Reply to comment by zaptrem in [R] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Massachusetts Institute of Technology and NVIDIA Guangxuan Xiao et al - Enables INT8 for LLM bigger than 100B parameters including OPT-175B, BLOOM-176B and GLM-130B. by Singularian2501

Nope - see my answer below.

diviramon t1_iydkhtc wrote on November 30, 2022 at 4:40 PM

Reply to comment by CommunismDoesntWork in [R] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Massachusetts Institute of Technology and NVIDIA Guangxuan Xiao et al - Enables INT8 for LLM bigger than 100B parameters including OPT-175B, BLOOM-176B and GLM-130B. by Singularian2501

Quantization only really works for inference. During training, the gradients are very sensitive to the decimal precision so FP32 is needed to compute them and for the training to converge. I have not seen a lot of training in INT8.