Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Lin, Zi; Liu, Jeremiah Zhe; Yang, Zi; Hua, Nan; Roth, Dan

doi:10.18653/v1/2020.findings-emnlp.64

Cited by 20 publications

(11 citation statements)

References 46 publications

(84 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that we report the results of Tiny-BERT without data augmentation mechanism to ensure fairness. 2) structured pruning: the most standard Firstorder pruning (Molchanov et al 2017) that CAP-f is based, Top-drop (Sajjad et al 2020), SNIP (Lin et al 2020), and schuBERT (Khetan and Karnin 2020). 3) unstructured pruning: Magnitude pruning (Han et al 2015), L 0regularization (Louizos, Welling, and Kingma 2018), and the state-of-the-art Movement pruning and Soft-movement pruning (Sanh, Wolf, and Rush 2020) that our CAP-m and CAP-soft are based on.…”

Section: Resultsmentioning

confidence: 99%

“…For structured pruning, some studies use the first-order Taylor expansion to calculate the importance scores of different heads and feed-forward networks based on the variation in the loss if we remove them (Molchanov et al 2017;Michel, Levy, and Neubig 2019;Prasanna, Rogers, and Rumshisky 2020;Liang et al 2021). Lin et al (2020) prune modules whose outputs are very small. Although the above structured pruning methods are matrix-wise, there are also some studies focusing on layer-wise (Fan, Grave, and Joulin 2020;Sajjad et al 2020), and row/column-wise (Khetan and Karnin 2020;Li et al 2020).…”

Section: Background Model Compressionmentioning

confidence: 99%

“…d) SQuAD v1.1, the Stanford Question Answering Dataset (Rajpurkar et al 2016), an extractive question answering task with crowdsourced question-answer pairs. Following most prior works (Lin et al 2020; Sanh, Wolf, and Rush 2020), we report results for the dev sets. The detailed statistics and the metrics are provided in Appendix.…”

Section: Experiments Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Luo

Wang³

et al. 2022

AAAI

View full text Add to dashboard Cite

Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm. With large quantities of parameters, PLMs are computation-intensive and resource-hungry. Hence, model pruning has been introduced to compress large-scale PLMs. However, most prior approaches only consider task-specific knowledge towards downstream tasks, but ignore the essential task-agnostic knowledge during pruning, which may cause catastrophic forgetting problem and lead to poor generalization ability. To maintain both task-agnostic and task-specific knowledge in our pruned model, we propose ContrAstive Pruning (CAP) under the paradigm of pre-training and fine-tuning. It is designed as a general framework, compatible with both structured and unstructured pruning. Unified in contrastive learn- ing, CAP enables the pruned model to learn from the pre-trained model for task-agnostic knowledge, and fine-tuned model for task-specific knowledge. Besides, to better retain the performance of the pruned model, the snapshots (i.e., the intermediate models at each pruning iteration) also serve as effective supervisions for pruning. Our extensive experiments show that adopting CAP consistently yields significant improvements, especially in extremely high sparsity scenarios. With only 3% model parameters reserved (i.e., 97% sparsity), CAP successfully achieves 99.2% and 96.3% of the original BERT performance in QQP and MNLI tasks. In addition, our probing experiments demonstrate that the model pruned by CAP tends to achieve better generalization ability.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Background Model Compressionmentioning

confidence: 99%

See 1 more Smart Citation

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Luo

Wang³

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Recent works have followed two methodologies for defining the search space for MHSA module -1) Searching for the number of heads in every distinct MHSA module [8,25] and/or 2) Searching for a common feature dimension size from a pre-defined discrete sample space for all attention heads in any particular MHSA module [6,7,42]. These methods have shown some solid results but they are not completely flexible.…”

Section: Flexible Mhsa and Mlp Modulesmentioning

confidence: 99%

Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

Chavan¹,

Shen²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure from the original model endto-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified 1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes ∼43 GPU hours for searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a retraining process is performed to obtain the final models. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by ∼0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our source code will be publicly available later.

show abstract

“…As complex neural network-based models come to dominate the research on document ranking, it is unsurprising that there is renewed interest in the question above, not just in the information retrieval community but also in related branches such as natural language processing. Interestingly, many of the proposals put forward to date to contain efficiency are reincarnations of past ideas, such as stage-wise ranking with BERT-based models (Nogueira et al, 2019a;Matsubara et al, 2020), early-exit strategies in Transformers (Soldaini and Moschitti, 2020;Xin et al, 2020;Xin et al, 2021), neural connection pruning (Gordon et al, 2020;McCarley et al, 2021;Lin et al, 2020b;Liu et al, 2021), precomputation of representations (MacAvaney et al, 2020b), and enhancing indexes (Zhuang and Zuccon, 2022;Nogueira et al, 2019b;Mallia et al, 2022;Lassance and Clinchant, 2022). Other novel but general ideas such as knowledge distillation (Jiao et al, 2020;Sanh et al, 2020;Gao et al, 2020) have also proved effective in reducing the size of deep models.…”

Section: Dimension Definition Scopementioning

confidence: 99%

ReNeuIR: Reaching Efficiency in Neural Information Retrieval

Bruch¹,

Lucchese

Nardini

2022

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

As information retrieval researchers, we not only develop algorithmic solutions to hard problems, but we also insist on a proper, multifaceted evaluation of ideas. The literature on the fundamental topic of retrieval and ranking, for instance, has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers, while at the same time quantifying their computational costs, from creation and training to application and inference. This is evidenced, for example, by more than a decade of research on efficient training and inference of large decision forest models in Learning to Rank (LtR). As we move towards even more complex, deep learning models in a wide range of applications, questions on efficiency have once again resurfaced with renewed urgency. Indeed, efficiency is no longer limited to time and space; instead it has found new, challenging dimensions that stretch to resource-, sampleand energy-efficiency with ramifications for researchers, users, and the environment.This monograph takes a step towards promoting the study of efficiency in the era of neural information retrieval by offering a comprehensive survey of the literature on efficiency and effectiveness in ranking, and to a limited extent, retrieval. This monograph was inspired by the parallels that exist between the challenges in neural network-based ranking solutions and their predecessors, decision forest-based LtR models, as well as the connections between the solutions the literature to date has to offer. We believe that by understanding the fundamentals underpinning these algorithmic and data structure solutions for containing the contentious relationship between efficiency and effectiveness, one can better identify future directions and more efficiently determine the merits of ideas. We also present what we believe to be important research directions in the forefront of efficiency and effectiveness in retrieval and ranking.

show abstract

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Cited by 20 publications

References 46 publications

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

ReNeuIR: Reaching Efficiency in Neural Information Retrieval

Contact Info

Product

Resources

About