2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01193
|View full text |Cite
|
Sign up to set email alerts
|

When to Prune? A Policy towards Early Structural Pruning

Abstract: Data often arrives in sequence over time in real-world deep learning applications such as autonomous driving.When new training data is available, training the model from scratch undermines the benefit of leveraging the learned knowledge, leading to significant training costs. Warm-starting from a previously trained checkpoint is the most intuitive way to retain knowledge and advance learning. However, existing literature suggests that this warmstarting degrades generalization. In this paper, we advocate for wa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
7
1
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(8 citation statements)
references
References 23 publications
0
8
0
Order By: Relevance
“…DNNShifter is primarily limited by the high computation cost of training sparse models. There is potential for structured pruning to be conducted at the initialisation of the model (before training) with minimal accuracy loss [49,50]. This will be explored in the future.…”
Section: Discussionmentioning
confidence: 99%
“…DNNShifter is primarily limited by the high computation cost of training sparse models. There is potential for structured pruning to be conducted at the initialisation of the model (before training) with minimal accuracy loss [49,50]. This will be explored in the future.…”
Section: Discussionmentioning
confidence: 99%
“…Li et al [ 41 ] determined channel configuration using a random search. Shen et al [ 42 ] pruned channels globally based on magnitude and gradient criteria. Unlike pruning-only methods, Hou et al [ 43 ] proposed a pruning-and-regrowing method to avoid removing important channels.…”
Section: Related Workmentioning
confidence: 99%
“…Provided there are N tokens of d dimension corresponding with the image patches, the self-attention to correlate every couple from the permutation of the N tokens will result in O(N 2 d) complexity in a simple updating round. For deploying Transformer on edge devices, a variety of simplified models have been proposed, aiming to reduce parameters and operations, for example, parameter pruning [12,28], low-rank factorization [38], and knowledge distillation [24,35]. Yet, these strategies for acceleration are limited in that they still rely on CNN, which deviates from the original design of Transformer, that is, facilitating deep learning with a new working mechanism other than CNN.…”
Section: Related Workmentioning
confidence: 99%