TernaryBERT: Distillation-aware Ultra-low Bit BERT

Zhang, Wei; Hou, Lu; Yin, Yichun; Shang, Lifeng; Chen, Xiao; Jiang, Xin; Li, Qun

doi:10.18653/v1/2020.emnlp-main.37

Cited by 100 publications

(109 citation statements)

References 22 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…While BinaryBERT focuses on weight binarization, we also explore activation quantization in our implementation, which is beneficial for reducing the computation burden on specialized hardwares (Hubara et al, 2016;Zhou et al, 2016;Zhang et al, 2020). Aside from 8-bit uniform quantization (Zhang et al, 2020; in past efforts, we further pioneer to study 4-bit activation quantization. We find that uniform quantization can hardly deal with outliers in the activation.…”

Section: Methodsmentioning

confidence: 99%

“…Quantization Details. Following (Zhang et al, 2020), for each weight matrix in the Transformer layers, we use layer-wise ternarization (i.e., one scaling parameter for all elements in the weight matrix). For word embedding, we use row-wise ternarization (i.e., one scaling parameter for each row in the embedding).…”

Section: Ternary Weight Splittingmentioning

confidence: 99%

“…Knowledge distillation is shown to benefit BERT quantization (Zhang et al, 2020). Following (Jiao et al, 2020;Zhang et al, 2020), we first perform intermediate-layer distillation from the fullprecision teacher network's embedding E, layerwise MHA output M l and FFN output F l to the quantized student counterpartÊ,M l ,F l (l = 1, 2, ...L). We aim to minimize their mean sqaured errors, i.e.,…”

Section: Ternary Weight Splittingmentioning

confidence: 99%

“…We report the mean results with standard deviations from 10 seeds on MRPC and 3 seeds on MNLI-m, respectively. Fan et al, 2020;Zhang et al, 2020). Among all these model compression approaches, quantization is a popular solution as it does not require designing a smaller model architecture.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

BinaryBERT: Pushing the Limit of BERT Quantization

Bai¹,

Zhang²,

Hou³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose Binary-BERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our Binary-BERT has only a slight performance drop compared with the full-precision model while being 24× smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks. (a) Full-precision Model. (b) Ternary Model. (c) Binary Model. (d) All Together.Figure 2: Loss landscapes visualization of the full-precision, ternary and binary models on MRPC. For (a), (b) and (c), we perturb the (latent) full-precision weights of the value layer in the 1 st and 2 nd Transformer layers, and compute their corresponding training loss. (d) shows the gap among the three surfaces by stacking them together. (a) MHA-QK. (b) MHA-V. (c) MHA-O. (d) FFN-Mid. (e) FFN-Out.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Ternary Weight Splittingmentioning

confidence: 99%

Section: Ternary Weight Splittingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BinaryBERT: Pushing the Limit of BERT Quantization

Bai¹,

Zhang²,

Hou³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…Researchers have made various attempts to accelerate the inference of PLMs, such as quantization (Shen et al, 2020;Zhang et al, 2020a), attention head pruning (Michel et al, 2019;, dimension reduction (Sun et al, 2020;Chen et al, 2020), and layer reduction (Sanh et al, 2019;Sun et al, 2019b;Jiao et al, 2019). In current studies, one of the mainstream methods is to dynamically select the layer number of Transformer layers to make a on-demand lighter model (Fan et al, 2020;.…”

Section: Related Workmentioning

confidence: 99%

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Ye¹,

Lin²,

Huang³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation. Specially, TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning. The experimental results on several downstream NLP tasks show that TR-BERT is able to speed up BERT by 2-5 times to satisfy various performance demands. Moreover, TR-BERT can also achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. The source code and experiment details of this paper can be obtained from https://github.com/ thunlp/TR-BERT.

show abstract