Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.64
|View full text |Cite
|
Sign up to set email alerts
|

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Abstract: Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectralnormalized identity priors (SNIP), a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm. It is applicable to any stru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(11 citation statements)
references
References 46 publications
(84 reference statements)
0
11
0
Order By: Relevance
“…Note that we report the results of Tiny-BERT without data augmentation mechanism to ensure fairness. 2) structured pruning: the most standard Firstorder pruning (Molchanov et al 2017) that CAP-f is based, Top-drop (Sajjad et al 2020), SNIP (Lin et al 2020), and schuBERT (Khetan and Karnin 2020). 3) unstructured pruning: Magnitude pruning (Han et al 2015), L 0regularization (Louizos, Welling, and Kingma 2018), and the state-of-the-art Movement pruning and Soft-movement pruning (Sanh, Wolf, and Rush 2020) that our CAP-m and CAP-soft are based on.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Note that we report the results of Tiny-BERT without data augmentation mechanism to ensure fairness. 2) structured pruning: the most standard Firstorder pruning (Molchanov et al 2017) that CAP-f is based, Top-drop (Sajjad et al 2020), SNIP (Lin et al 2020), and schuBERT (Khetan and Karnin 2020). 3) unstructured pruning: Magnitude pruning (Han et al 2015), L 0regularization (Louizos, Welling, and Kingma 2018), and the state-of-the-art Movement pruning and Soft-movement pruning (Sanh, Wolf, and Rush 2020) that our CAP-m and CAP-soft are based on.…”
Section: Resultsmentioning
confidence: 99%
“…For structured pruning, some studies use the first-order Taylor expansion to calculate the importance scores of different heads and feed-forward networks based on the variation in the loss if we remove them (Molchanov et al 2017;Michel, Levy, and Neubig 2019;Prasanna, Rogers, and Rumshisky 2020;Liang et al 2021). Lin et al (2020) prune modules whose outputs are very small. Although the above structured pruning methods are matrix-wise, there are also some studies focusing on layer-wise (Fan, Grave, and Joulin 2020;Sajjad et al 2020), and row/column-wise (Khetan and Karnin 2020;Li et al 2020).…”
Section: Background Model Compressionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent works have followed two methodologies for defining the search space for MHSA module -1) Searching for the number of heads in every distinct MHSA module [8,25] and/or 2) Searching for a common feature dimension size from a pre-defined discrete sample space for all attention heads in any particular MHSA module [6,7,42]. These methods have shown some solid results but they are not completely flexible.…”
Section: Flexible Mhsa and Mlp Modulesmentioning
confidence: 99%
“…As complex neural network-based models come to dominate the research on document ranking, it is unsurprising that there is renewed interest in the question above, not just in the information retrieval community but also in related branches such as natural language processing. Interestingly, many of the proposals put forward to date to contain efficiency are reincarnations of past ideas, such as stage-wise ranking with BERT-based models (Nogueira et al, 2019a;Matsubara et al, 2020), early-exit strategies in Transformers (Soldaini and Moschitti, 2020;Xin et al, 2020;Xin et al, 2021), neural connection pruning (Gordon et al, 2020;McCarley et al, 2021;Lin et al, 2020b;Liu et al, 2021), precomputation of representations (MacAvaney et al, 2020b), and enhancing indexes (Zhuang and Zuccon, 2022;Nogueira et al, 2019b;Mallia et al, 2022;Lassance and Clinchant, 2022). Other novel but general ideas such as knowledge distillation (Jiao et al, 2020;Sanh et al, 2020;Gao et al, 2020) have also proved effective in reducing the size of deep models.…”
Section: Dimension Definition Scopementioning
confidence: 99%