Submitted by fxmarty t3_z1titt in MachineLearning
younesbelkada t1_ixdyvls wrote
Reply to comment by JahrudZ in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty
because BetterTransformer merges the whole TransformerEncoderLayer operations in a single operation. This is called with the appropriate weights / biases at runtime.
For int8, each linear layer is replaced by the linear layer from bitsandbytes, that are slightly particular. At runtime it decomposes the matrix multiplication in two stages, and this is done with particular CUDA kernels. Therefore since this is not embedded in the fused operation from PyTorch, these two options are mutually exclusive. Please read more about int8 models here: https://huggingface.co/blog/hf-bitsandbytes-integration
Viewing a single comment thread. View all comments