2022
DOI: 10.48550/arxiv.2209.08247
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gradient Properties of Hard Thresholding Operator

Abstract: Sparse optimization receives increasing attention in many applications such as compressed sensing, variable selection in regression problems, and recently neural network compression in machine learning. For example, the problem of compressing a neural network is a bi-level, stochastic, and nonconvex problem that can be cast into a sparse optimization problem. Hence, developing efficient methods for sparse optimization plays a critical role in applications. The goal of this paper is to develop analytical techni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 29 publications
(39 reference statements)
0
3
0
Order By: Relevance
“…These results can aid research on compressing DNNs that utilize the full gradient, as noted in Damadi et al (2022). It can benefit sparse optimization in both deterministic and stochastic settings, as the Iterative Hard Thresholding (IHT) algorithm uses the full gradient for a sparse solution Damadi & Shen (2022b) in deterministic settings and the mini-batch Stochastic IHT algorithm is employed in the stochastic context Damadi & Shen (2022a). We provided concise mathematical justifications to make the results clear and useful for people from different fields, even if they did not have a deep understanding of the mathematics involved.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…These results can aid research on compressing DNNs that utilize the full gradient, as noted in Damadi et al (2022). It can benefit sparse optimization in both deterministic and stochastic settings, as the Iterative Hard Thresholding (IHT) algorithm uses the full gradient for a sparse solution Damadi & Shen (2022b) in deterministic settings and the mini-batch Stochastic IHT algorithm is employed in the stochastic context Damadi & Shen (2022a). We provided concise mathematical justifications to make the results clear and useful for people from different fields, even if they did not have a deep understanding of the mathematics involved.…”
Section: Discussionmentioning
confidence: 99%
“…For the special case of the ReLU function J z f (z) is a diagonal matrix of zeros and ones which are associated to the negative and positive elements of z. Multiplying such a matrix from the left to any matrix W results in removing the rows of W associated to zero elements in J z f (z) which greatly decreases the computation. The zero-th norm of the parameter vector can be minimized in sparse optimizations using these intuitions, as noted in Damadi & Shen (2022b).…”
Section: Jacobian Of Activation Functionsmentioning
confidence: 99%
“…By setting a low temperature of 0.5, the generated Gumbel-Softmax samples become almost identical to one-hot vectors, eliminating the need for gradient approximation. The effects on the gradient were carefully analyzed and the results are presented in Damadi and Shen (2022), where a comprehensive study of gradient properties was conducted. To learn richer representations, we define an embedding matrix E ∈ R T ×dt , to convert a simplex frame into a vector representation as e m = t T m E.…”
Section: Architecture For Event Modelingmentioning
confidence: 99%