2022
DOI: 10.48550/arxiv.2203.13483
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MKQ-BERT: Quantized BERT with 4-bits Weights and Activations

Abstract: Recently, pre-trained Transformer based language models, such as BERT, have shown great superiority over the traditional methods in many Natural Language Processing (NLP) tasks. However, the computational cost for deploying these models is prohibitive on resource-restricted devices. One method to alleviate this computation overhead is to quantize the original model into fewer bits representation, and previous work has proved that we can at most quantize both weights and activations of BERT into 8-bits, without… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 20 publications
0
3
0
Order By: Relevance
“…Weight quantization involves mapping model weights to low-precision integers and floating-point numbers, making them more hardware-friendly for computation. Specifically, (Xiao et al, 2022) proposed 8-bit quantization for BERT; (Tang et al, 2022) explored how to use 4 bits to quantize BERT; (Bai et al, 2020;Tian et al, 2023; studied how to quantize BERT into 1-bit or 2-bits. Pruning, on the other hand, focuses on setting redundant parameters to zero to create a sparse network, enabling accelerated sparse matrix operations on specific hardware platforms.…”
Section: A Related Workmentioning
confidence: 99%
“…Weight quantization involves mapping model weights to low-precision integers and floating-point numbers, making them more hardware-friendly for computation. Specifically, (Xiao et al, 2022) proposed 8-bit quantization for BERT; (Tang et al, 2022) explored how to use 4 bits to quantize BERT; (Bai et al, 2020;Tian et al, 2023; studied how to quantize BERT into 1-bit or 2-bits. Pruning, on the other hand, focuses on setting redundant parameters to zero to create a sparse network, enabling accelerated sparse matrix operations on specific hardware platforms.…”
Section: A Related Workmentioning
confidence: 99%
“…Wu et al (2022) prove that even the binary network can result in only a small degradation if applying QAT with knowledge distillation (KD) (Hinton et al, 2014) and longer training, but the activations are quantized to INT8 (using INT8 computation,not INT4). Tang et al (2022) are the first to claim to apply W4A4 to BERT for inference with QAT and KD. However, their quantization method fails to enable W4A4 for all but only the last two layers in a four-layer TinyBERT model (otherwise causing drastic accuracy drops).…”
Section: Introductionmentioning
confidence: 99%
“…2: End-to-end inference time (ms) for running one layer in BERT-base model with different batch size and sequence length on NVIDIA T4 GPUs. Column 2 to 4 are numbers taken fromTang et al (2022). FasterTransformer(FT) requires sequence length to be multiple of 32, thus the inputs in the parenthesis are used to run FasterTransformer.…”
mentioning
confidence: 99%