Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.829
|View full text |Cite
|
Sign up to set email alerts
|

Block Pruning For Faster Transformers

Abstract: Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tunin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(55 citation statements)
references
References 14 publications
0
38
0
Order By: Relevance
“…Storing such sparse matrices does not lead to immediate gains and sparse matrix multiplication is not always faster, especially on GPUs (Gale et al, 2020). As such, other work considers structured pruning of entire rows or columns of the matrices, which makes it much easier to realize efficiency gains (Fan et al, 2021;Lagunas et al, 2021). We explore an alternative structured pruning approach, rank pruning .…”
Section: Pruning Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Storing such sparse matrices does not lead to immediate gains and sparse matrix multiplication is not always faster, especially on GPUs (Gale et al, 2020). As such, other work considers structured pruning of entire rows or columns of the matrices, which makes it much easier to realize efficiency gains (Fan et al, 2021;Lagunas et al, 2021). We explore an alternative structured pruning approach, rank pruning .…”
Section: Pruning Methodsmentioning
confidence: 99%
“…However, sparsifying a matrix can lead to specialized hardware and algorithmic optimizations as demonstrated by sparse multiplication libraries (Gale et al, 2020). Lagunas et al (2021) optimize element-wise unstructured pruning in a simple manner by removing entirely pruned rows, columns or attention heads. They show that even at high sparsities (more than 90%), this strategy achieves at most around a 1.5× speedup.…”
Section: Runtime Comparisonmentioning
confidence: 99%
“…Model compression and knowledge distillation present additional opportunities to improve inference performance further. While they are many ways for model compression, such as quantization [38,39,40] and pruning [41,42], our current efforts focus on layer reduction through knowledge distillation [43] (KD) -reducing both model size and model computation, and preserving MoE structure at student model. KD has been proven to be a successful way to compress a large model into a small one, which contains much fewer parameters and computations but still obtaining competitive results.…”
Section: Mixture-of-students: Distillation For Even Smaller Model Siz...mentioning
confidence: 99%
“…As might be expected, the impact is dictated by the severity of the constraints. If the partitions are too small, or the blocks too large, accuracy becomes degraded to an unacceptable extent [40].…”
Section: Structured Sparsitymentioning
confidence: 99%
“…Block and partitioned sparsity help align the patterns of non-zero elements with hardware requirements, but are fundamentally at odds with creating highly sparse and accurate networks. Optimal performance requires large blocks and reduced partition sizes but this limits both the obtainable sparsity and the accuracy [40]. This in turn compromises the approaches from achieving the theoretical performance benefits of highly sparse networks.…”
Section: Complementary Sparsitymentioning
confidence: 99%