Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.334
|View full text |Cite
|
Sign up to set email alerts
|

BinaryBERT: Pushing the Limit of BERT Quantization

Abstract: The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose Binary-BERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
92
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(94 citation statements)
references
References 29 publications
2
92
0
Order By: Relevance
“…In this case, it potentially results in more carbon emission due to more training time on GPU servers, especially when training with larger models and datasets. Future works may consider combining gradient quantization [1] or utilize lower-bits quantization [3,49] to achieve better training efficiency. To understand the effect of different λ in Mesa, we conduct experiments with DeiT-Ti on CIFAR-100 and report the results in Table 9.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this case, it potentially results in more carbon emission due to more training time on GPU servers, especially when training with larger models and datasets. Future works may consider combining gradient quantization [1] or utilize lower-bits quantization [3,49] to achieve better training efficiency. To understand the effect of different λ in Mesa, we conduct experiments with DeiT-Ti on CIFAR-100 and report the results in Table 9.…”
Section: Discussionmentioning
confidence: 99%
“…In Transformers, the majority of the literature belongs to the first category. For example, 8-bit [39,48] or even lower-bits [3] quantization has been proposed to speed up the inference. In contrast, this paper focuses on training Transformers from scratch.…”
Section: Introductionmentioning
confidence: 99%
“…For example, Bhandare et al (2019) and Prato et al (2020) showed that 8-bit quantization can successfully reduce the size of a Transformer-based model and accelerate inference without compromising translation quality. Recently, quantization has been applied on Transformerbased language models (Zafrir et al, 2019;Bai et al, 2020;. Zafrir et al (2019) first applied 8-bit quantization on BERT.…”
Section: Quantizationmentioning
confidence: 99%
“…Pruning and quantization are widely applied techniques for compressing deep neural networks prior to deployment, as compressed models require less memory, energy consumption and have lower inference latency (Esteva et al, 2017;Lane and Warden, 2018;Sun et al, 2020). To-date, evaluating the merits and tradeoffs incurred by compression have overwhelmingly centered on settings where the data is relatively abundant Li et al, 2020a;Chen et al, 2021;Bai et al, 2020;ab Tessera et al, 2021).…”
Section: Introductionmentioning
confidence: 99%