MKQ-BERT: Quantized BERT with 4-bits Weights and Activations

Tang, Hanlin; Li, Kai; Zhu, Jianchen; Zhanhui, Kang,

doi:10.48550/arxiv.2203.13483

Cited by 2 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weight quantization involves mapping model weights to low-precision integers and floating-point numbers, making them more hardware-friendly for computation. Specifically, (Xiao et al, 2022) proposed 8-bit quantization for BERT; (Tang et al, 2022) explored how to use 4 bits to quantize BERT; (Bai et al, 2020;Tian et al, 2023; studied how to quantize BERT into 1-bit or 2-bits. Pruning, on the other hand, focuses on setting redundant parameters to zero to create a sparse network, enabling accelerated sparse matrix operations on specific hardware platforms.…”

Section: A Related Workmentioning

confidence: 99%

HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference

Wang,

Liu

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Pre-trained language models (LMs) have brought remarkable performance to numerous NLP tasks. However, they require significant resources and entail high computational costs for inference, making it challenging to deploy them in real-world and real-time systems. Existing early exiting methods aim to reduce computational complexity by selecting the layer at which to exit, but suffer from the limitation that they have to sequentially traverse through all layers prior to the selected exit layer, which lacks flexibility and degrades their performance.To solve this problem, we propose a homotopic and adaptive layer skipping fine-tuning method named HadSkip. HadSkip adaptively selects the layers to skip based on a predefined budget. Specifically, we introduce a learnable gate before each layer of the LM to determine whether the current layer should be skipped. To tackle various challenges in training brought by discrete gates and budget constraints, we propose a fine-grained initialization strategy and homotopic optimization strategy. We conduct extensive experiments on the GLUE benchmark, and experimental results demonstrate the proposed HadSkip outperforms all state-of-the-art baselines significantly.

show abstract

Section: A Related Workmentioning

confidence: 99%

HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference

Wang,

Liu

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Wu et al (2022) prove that even the binary network can result in only a small degradation if applying QAT with knowledge distillation (KD) (Hinton et al, 2014) and longer training, but the activations are quantized to INT8 (using INT8 computation,not INT4). Tang et al (2022) are the first to claim to apply W4A4 to BERT for inference with QAT and KD. However, their quantization method fails to enable W4A4 for all but only the last two layers in a four-layer TinyBERT model (otherwise causing drastic accuracy drops).…”

Section: Introductionmentioning

confidence: 99%

“…2: End-to-end inference time (ms) for running one layer in BERT-base model with different batch size and sequence length on NVIDIA T4 GPUs. Column 2 to 4 are numbers taken fromTang et al (2022). FasterTransformer(FT) requires sequence length to be multiple of 32, thus the inputs in the parenthesis are used to run FasterTransformer.…”

mentioning

confidence: 99%

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Wu¹,

Li²,

Aminabadi³

et al. 2023

Preprint

View full text Add to dashboard Cite

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this work, we fully investigate the feasibility of using INT4 quantization for language models, and show that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using INT4, we develop a highly-optimized end-to-end INT4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is 8.5× faster for latency-oriented scenarios and up to 3× for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to 1.7×. We also provide insights into the failure cases when applying INT4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression techniques, like pruning and layer reduction.

show abstract

MKQ-BERT: Quantized BERT with 4-bits Weights and Activations

Cited by 2 publications

References 20 publications

HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference

HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Contact Info

Product

Resources

About