Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.37
|View full text |Cite
|
Sign up to set email alerts
|

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Abstract: Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
106
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 100 publications
(109 citation statements)
references
References 22 publications
3
106
0
Order By: Relevance
“…While BinaryBERT focuses on weight binarization, we also explore activation quantization in our implementation, which is beneficial for reducing the computation burden on specialized hardwares (Hubara et al, 2016;Zhou et al, 2016;Zhang et al, 2020). Aside from 8-bit uniform quantization (Zhang et al, 2020; in past efforts, we further pioneer to study 4-bit activation quantization. We find that uniform quantization can hardly deal with outliers in the activation.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…While BinaryBERT focuses on weight binarization, we also explore activation quantization in our implementation, which is beneficial for reducing the computation burden on specialized hardwares (Hubara et al, 2016;Zhou et al, 2016;Zhang et al, 2020). Aside from 8-bit uniform quantization (Zhang et al, 2020; in past efforts, we further pioneer to study 4-bit activation quantization. We find that uniform quantization can hardly deal with outliers in the activation.…”
Section: Methodsmentioning
confidence: 99%
“…Quantization Details. Following (Zhang et al, 2020), for each weight matrix in the Transformer layers, we use layer-wise ternarization (i.e., one scaling parameter for all elements in the weight matrix). For word embedding, we use row-wise ternarization (i.e., one scaling parameter for each row in the embedding).…”
Section: Ternary Weight Splittingmentioning
confidence: 99%
See 2 more Smart Citations
“…Researchers have made various attempts to accelerate the inference of PLMs, such as quantization (Shen et al, 2020;Zhang et al, 2020a), attention head pruning (Michel et al, 2019;, dimension reduction (Sun et al, 2020;Chen et al, 2020), and layer reduction (Sanh et al, 2019;Sun et al, 2019b;Jiao et al, 2019). In current studies, one of the mainstream methods is to dynamically select the layer number of Transformer layers to make a on-demand lighter model (Fan et al, 2020;.…”
Section: Related Workmentioning
confidence: 99%