Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win

Maene, Jaron; Li, Mingxiao; Moens, Marie-Francine

doi:10.48550/arxiv.2106.06955

“…1) Iterative Magnitude Pruning: Iterative Magnitude Pruning is a common approach to start with training a dense network and subsequently removing weights based on a specific criterion, such as magnitude (absolute value) [16]. For optimal results, this process is typically repeated iteratively by alternating between weight pruning and network retraining.…”

Section: B Pruning Strategiesmentioning

confidence: 99%

“…IMP is an effective method for reducing the size of neural networks and improving their efficiency without significant loss of accuracy. IMP has been used to prune LLMs as well [16], but there are a few limitations such as re-training overhead, dense connections and structural redundancy in the transformer architecture.…”

Section: B Pruning Strategiesmentioning

confidence: 99%

Explainable Attention Pruning: A Metalearning-Based Approach

Rajapaksha,

Crespi

2024

IEEE Trans. Artif. Intell.

1

0

View full text Add to dashboard Cite

Pruning, as a technique to reduce the complexity and size of Transformer-based models, has gained significant attention in recent years. While various models have been successfully pruned, pruning BERT poses unique challenges due to their fine-grained structure and overparameterization. However, by carefully considering these factors, it is possible to prune BERT without significantly degrading its pre-trained loss. In this paper, we propose a Meta-learning-based pruning approach that can adaptively identify and eliminate insignificant attention weights. The performance of the proposed model is compared with several baseline models, as well as the default fine-tuned BERT model. The baseline pruning strategies employed low-level pruning techniques, targeting the removal of only 20% of the connections. The experimental results show that the proposed model outperforms the other baseline models, in terms of lower inference latency, higher MCC and lower loss. However, there is no significant improvement observed in terms of average FLOPs (floating-point operations per second). Furthermore, we conduct a comparative evaluation of the baseline models and our proposed model using two explainable (XAI) approaches. While other models allocate reasonable attention to less significant words for sentiment classification, our model assigns higher probabilities to the most significant sentimental words.Impact Statement-Efficient handling of inference time in pre-trained language models (PLMs) and the preservation of performance while reducing their size are important research considerations. Model compression techniques, such as pruning, are recognized as effective approaches for achieving memoryefficient, energy-efficient, computation-efficient, and storageefficient PLMs. Pruning addresses the need to create compact models without compromising their overall effectiveness. Existing pruning methods often rely on task and domain-specific approaches and therefore, it is important to explore a domainindependent pruning approach. We propose a new pruning strategy called Meta-Controller-based Attention Pruning (MCAP) for the BERT model targeting single-sentence prediction tasks. MCAP optimization strategy eliminates insignificant attention in the BERT by calculating their importance scores. The selfsupervised pruner in MCAP uses a meta-learning approach to identify and eliminate these insignificant attentions before finetuning. Our study compares MCAP with baseline models (both structured and unstructured pruning) and compared it with inference latency, MCC, and loss parameters. The results show that MCAP outperforms the baseline models in terms of inference latency, MCC, and loss. Explainable AI (XAI) techniques are used to interpret the model's decisions and predictions. MCAP focuses on significant words in sentiment classification, ensuring important model parameters are retained without a significant impact on output.

show abstract

“…Parameters that are not in this low-dimensional subspace can, therefore, be removed with minimal impact. If a sparse DNN is initialized in this subspace (as late rewinding aims to do), then it may be possible for training to find the same, or related, local minima as the full DNN [13,27].…”

Section: Connection Between the Rg And Standard Lth Frameworkmentioning

confidence: 99%

Universality of Winning Tickets: A Renormalization Group Perspective

Redman¹,

Chen²,

Wang³

et al. 2021

Preprint

0

View full text Add to dashboard Cite

Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. While this has become of broad practical and theoretical interest, to date, there exists no detailed understanding of why winning ticket universality exists, or any way of knowing a priori whether a given ticket can be transferred to a given task. To address these outstanding open questions, we make use of renormalization group theory, one of the most successful tools in theoretical physics. We find that iterative magnitude pruning, the method used for discovering winning tickets, is a renormalization group scheme. This opens the door to a wealth of existing numerical and theoretical tools, some of which we leverage here to examine winning ticket universality in large scale lottery ticket experiments, as well as sheds new light on the success iterative magnitude pruning has found in the field of sparse machine learning.

show abstract

Explainable Attention Pruning: A Meta-learning-based Approach

Rajapaksha,

Crespi

2024

Preprint

0

View full text Add to dashboard Cite

Pruning, as a technique to reduce the complexity and size of Transformer-based models, has gained significant attention in recent years. While various models have been successfully pruned, pruning BERT poses unique challenges due to their fine-grained structure and overparameterization. However, by carefully considering these factors, it is possible to prune BERT without significantly degrading its pre-trained loss. In this paper, we propose a Meta-learning-based pruning approach that can adaptively identify and eliminate insignificant attention weights. The performance of the proposed model is compared with several baseline models, as well as the default fine-tuned BERT model. The baseline pruning strategies employed low-level pruning techniques, targeting the removal of only 20% of the connections. The experimental results show that the proposed model outperforms the other baseline models, in terms of lower inference latency, higher MCC and lower loss. However, there is no significant improvement observed in terms of average FLOPs (floating-point operations per second). Furthermore, we conduct a comparative evaluation of the baseline models and our proposed model using two explainable (XAI) approaches. While other models allocate reasonable attention to less significant words for sentiment classification, our model assigns higher probabilities to the most significant sentimental words.Impact Statement-Efficient handling of inference time in pre-trained language models (PLMs) and the preservation of performance while reducing their size are important research considerations. Model compression techniques, such as pruning, are recognized as effective approaches for achieving memoryefficient, energy-efficient, computation-efficient, and storageefficient PLMs. Pruning addresses the need to create compact models without compromising their overall effectiveness. Existing pruning methods often rely on task and domain-specific approaches and therefore, it is important to explore a domainindependent pruning approach. We propose a new pruning strategy called Meta-Controller-based Attention Pruning (MCAP) for the BERT model targeting single-sentence prediction tasks. MCAP optimization strategy eliminates insignificant attention in the BERT by calculating their importance scores. The selfsupervised pruner in MCAP uses a meta-learning approach to identify and eliminate these insignificant attentions before finetuning. Our study compares MCAP with baseline models (both structured and unstructured pruning) and compared it with inference latency, MCC, and loss parameters. The results show that MCAP outperforms the baseline models in terms of inference latency, MCC, and loss. Explainable AI (XAI) techniques are used to interpret the model's decisions and predictions. MCAP focuses on significant words in sentiment classification, ensuring important model parameters are retained without a significant impact on output.

show abstract

Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win

Cited by 3 publications

References 9 publications

Explainable Attention Pruning: A Metalearning-Based Approach

Explainable Attention Pruning: A Metalearning-Based Approach

Universality of Winning Tickets: A Renormalization Group Perspective

Explainable Attention Pruning: A Meta-learning-based Approach

Contact Info

Product

Resources

About