Lottery Tickets in Linear Models: An Analysis of Iterative Magnitude Pruning

Elesedy, Bryn; Kanade, Varun; Teh, Yee Whye

doi:10.48550/arxiv.2007.08243

Cited by 7 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While dynamical systems theory is a natural language with which to frame DNN optimization, the complex dependence on optimizer, architecture, activation function, and training data has historically kept efforts in this direction to a minimum. This has led to a reliance on heuristic methods, such as iterative magnitude pruning, whose basis of success is still not clear [8]. Other groups have attempted to more principally examine DNN behavior by studying mathematical objects, such as the spectrum of the Hessian matrix [12] and the spectrum of the principal orthogonal decomposition [24].…”

Section: Discussionmentioning

confidence: 99%

An Operator Theoretic View on Pruning Deep Neural Networks

Redman¹,

Fonoberova²,

Mohr³

et al. 2021

Preprint

View full text Add to dashboard Cite

The discovery of sparse subnetworks that are able to perform as well as full models has found broad applied and theoretical interest. While many pruning methods have been developed to this end, the naïve approach of removing parameters based on their magnitude has been found to be as robust as more complex, state-of-the-art algorithms. The lack of theory behind magnitude pruning's success, especially pre-convergence, and its relation to other pruning methods, such as gradient based pruning, are outstanding open questions in the field that are in need of being addressed. We make use of recent advances in dynamical systems theory, namely Koopman operator theory, to define a new class of theoretically motivated pruning algorithms. We show that these algorithms can be equivalent to magnitude and gradient based pruning, unifying these seemingly disparate methods, and that they can be used to shed light on magnitude pruning's performance during early training.

show abstract

Section: Discussionmentioning

confidence: 99%

An Operator Theoretic View on Pruning Deep Neural Networks

Redman¹,

Fonoberova²,

Mohr³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…where W is the mean of W with entries W i = λ 0 i w P i and A(P ) is the feature alignment vector for the data S P , with entries A i (P ) = 1 |X P | ψ i (X P ) T Y P , i.e., A i (P ) measures how closely feature i is aligned to the outputs Y P [20]. Then E[W T Σ P W ] = Tr(ΓΣ P ), where…”

Section: Linear Model Analysismentioning

confidence: 99%

“…Recent work by Elesedy, Kanade, and Teh [20] has shown that, in the context of linear models, magnitude pruning zeros out the weights based on the magnitude of feature alignment under certain assumptions on the feature covariance matrix. Our analysis is complementary to these findings.…”

Section: Related Workmentioning

confidence: 99%

Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

Hayou¹,

He²,

Dziugaite³

2021

Preprint

View full text Add to dashboard Cite

We study an approach to learning pruning masks by optimizing the expected loss of stochastic pruning masks, i.e., masks which zero out each weight independently with some weight-specific probability. We analyze the training dynamics of the induced stochastic predictor in the setting of linear regression, and observe a data-adaptive L1 regularization term, in contrast to the dataadaptive L2 regularization term known to underlie dropout in linear regression. We also observe a preference to prune weights that are less well-aligned with the data labels. We evaluate probabilistic fine-tuning for optimizing stochastic pruning masks for neural networks, starting from masks produced by several baselines (namely, magnitude pruning [1], SNIP [2], and random masks). In each case, we see improvements in test error over baselines, even after we threshold fine-tuned stochastic pruning masks. Finally, since a stochastic pruning mask induces a stochastic neural network, we consider training the weights and/or pruning probabilities simultaneously to minimize a PAC-Bayes bound on generalization error. Using data-dependent priors [3], we obtain a selfbounded learning algorithm with strong performance and numerically tight bounds. In the linear model, we show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the "prior" and "posterior" data.

show abstract

“…While exciting, to date there exists no principled understanding of why winning tickets can be transferred between tasks, nor does there exist a way to know, a priori, which tasks a given winning ticket can be transferred to. Additionally, there is a lack of theoretical work on iterative magnitude pruning (IMP) [11], the most common method used to find winning tickets. This is in striking analogy to the state of statistical physics in the early-to-mid-20 th century.…”

Section: Introductionmentioning

confidence: 99%

Universality of Winning Tickets: A Renormalization Group Perspective

Redman¹,

Chen²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. While this has become of broad practical and theoretical interest, to date, there exists no detailed understanding of why winning ticket universality exists, or any way of knowing a priori whether a given ticket can be transferred to a given task. To address these outstanding open questions, we make use of renormalization group theory, one of the most successful tools in theoretical physics. We find that iterative magnitude pruning, the method used for discovering winning tickets, is a renormalization group scheme. This opens the door to a wealth of existing numerical and theoretical tools, some of which we leverage here to examine winning ticket universality in large scale lottery ticket experiments, as well as sheds new light on the success iterative magnitude pruning has found in the field of sparse machine learning.

show abstract

Lottery Tickets in Linear Models: An Analysis of Iterative Magnitude Pruning

Cited by 7 publications

References 7 publications

An Operator Theoretic View on Pruning Deep Neural Networks

An Operator Theoretic View on Pruning Deep Neural Networks

Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

Universality of Winning Tickets: A Renormalization Group Perspective

Contact Info

Product

Resources

About