Differentiable Subset Pruning of Transformer Heads

Li, Jiaoda; Cotterell, Ryan; Sachan, Mrinmaya

doi:10.1162/tacl_a_00436

Cited by 18 publications

(9 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Only removing heads does not lead to large latency improvement- Li et al (2021) demonstrate a 1.4× speedup with only one remaining head per layer.…”

Section: Pruningmentioning

confidence: 98%

Structured Pruning Learns Compact and Accurate Models

Xia¹,

Zhong²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

The growing size of neural language models has led to increased attention in model compression.The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi 1 (Coarse-and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10× speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches. 2

show abstract

“…Only removing heads does not lead to large latency improvement- Li et al (2021) demonstrate a 1.4× speedup with only one remaining head per layer.…”

Section: Pruningmentioning

confidence: 98%

Structured Pruning Learns Compact and Accurate Models

Xia¹,

Zhong²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Pruning methods In this work we replaced the attention matrix with a constant one in order to measure the importance of the input-dependent ability. Works like Michel et al (2019) and Li et al (2021) pruned attention heads in order to measure their importance for the task examined. These works find that for some tasks, only a small number of unpruned attention heads is sufficient, and thus relate to the question of how much attention does a PLM use.…”

Section: Related Workmentioning

confidence: 99%

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Hassid¹,

Peng²,

Rotem³

et al. 2022

Preprint

View full text Add to dashboard Cite

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, 1 a new probing method that replaces the input-dependent attention matrices with constant ones-the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance-an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

show abstract

“…The top-k attention heads and hidden dimensions with the highest importance scores are kept. The implementations for STEP are borrowed from [35]. Aside from the baseline methods, we also compare our method with the previous pruning method VPT [67].…”

Section: Methodsmentioning

confidence: 99%

Pruning Self-attentions into Convolutional Layers in Single Path

He¹,

Liu²,

Pan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. However, modeling global correlations with multi-head self-attention (MSA) layers leads to two widely recognized issues: the massive computational resource consumption and the lack of intrinsic inductive bias for modeling local visual patterns. One unified solution is to search whether to replace some MSA layers with convolution-like inductive biases that are computationally efficient via neural architecture search (NAS) based pruning methods. However, maintaining MSA and different candidate convolutional operations as separate trainable paths gives rise to expensive search cost and challenging optimization. Instead, we propose a novel weight-sharing scheme between MSA and convolutional operations and cast the search problem as finding which subset of parameters to use in each MSA layer. The weightsharing scheme further allows us to devise an automatic Single-Path Vision Transformer pruning method (SPViT) to quickly prune the pre-trained ViTs into accurate and compact hybrid models with significantly reduced search cost, given target efficiency constraints. We conduct extensive experiments on two representative ViT models showing our method achieves a favorable accuracy-efficiency trade-off. Code is available at https://github.com/zhuanggroup/SPViT.

show abstract

Differentiable Subset Pruning of Transformer Heads

Cited by 18 publications

References 23 publications

Structured Pruning Learns Compact and Accurate Models

Structured Pruning Learns Compact and Accurate Models

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Pruning Self-attentions into Convolutional Layers in Single Path

Contact Info

Product

Resources

About