Gradient Properties of Hard Thresholding Operator

Damadi, Saeed; Shen, Jinglai

doi:10.48550/arxiv.2209.08247

Cited by 2 publications

(3 citation statements)

References 29 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These results can aid research on compressing DNNs that utilize the full gradient, as noted in Damadi et al (2022). It can benefit sparse optimization in both deterministic and stochastic settings, as the Iterative Hard Thresholding (IHT) algorithm uses the full gradient for a sparse solution Damadi & Shen (2022b) in deterministic settings and the mini-batch Stochastic IHT algorithm is employed in the stochastic context Damadi & Shen (2022a). We provided concise mathematical justifications to make the results clear and useful for people from different fields, even if they did not have a deep understanding of the mathematics involved.…”

Section: Discussionmentioning

confidence: 99%

“…For the special case of the ReLU function J z f (z) is a diagonal matrix of zeros and ones which are associated to the negative and positive elements of z. Multiplying such a matrix from the left to any matrix W results in removing the rows of W associated to zero elements in J z f (z) which greatly decreases the computation. The zero-th norm of the parameter vector can be minimized in sparse optimizations using these intuitions, as noted in Damadi & Shen (2022b).…”

Section: Jacobian Of Activation Functionsmentioning

confidence: 99%

See 1 more Smart Citation

The Backpropagation algorithm for a math student

Damadi¹,

Moharrer²,

Cham³

2023

Preprint

View full text Add to dashboard Cite

A Deep Neural Network (DNN) is a composite function of vector-valued functions, and in order to train a DNN, it is necessary to calculate the gradient of the loss function with respect to all parameters. This calculation can be a non-trivial task because the loss function of a DNN is a composition of several nonlinear functions, each with numerous parameters. The Backpropagation (BP) algorithm leverages the composite structure of the DNN to efficiently compute the gradient. As a result, the number of layers in the network does not significantly impact the complexity of the calculation. The objective of this paper is to express the gradient of the loss function in terms of a matrix multiplication using the Jacobian operator. This can be achieved by considering the total derivative of each layer with respect to its parameters and expressing it as a Jacobian matrix. The gradient can then be represented as the matrix product of these Jacobian matrices. This approach is valid because the chain rule can be applied to a composition of vector-valued functions, and the use of Jacobian matrices allows for the incorporation of multiple inputs and outputs. By providing concise mathematical justifications, the results can be made understandable and useful to a broad audience from various disciplines.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Jacobian Of Activation Functionsmentioning

confidence: 99%

The Backpropagation algorithm for a math student

Damadi¹,

Moharrer²,

Cham³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…By setting a low temperature of 0.5, the generated Gumbel-Softmax samples become almost identical to one-hot vectors, eliminating the need for gradient approximation. The effects on the gradient were carefully analyzed and the results are presented in Damadi and Shen (2022), where a comprehensive study of gradient properties was conducted. To learn richer representations, we define an embedding matrix E ∈ R T ×dt , to convert a simplex frame into a vector representation as e m = t T m E.…”

Section: Architecture For Event Modelingmentioning

confidence: 99%

RevUp: Revise and Update Information Bottleneck for Event Representation

Rezaee,

Ferraro

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

The existence of external ("side") semantic knowledge has been shown to result in more expressive computational event models. To enable the use of side information that may be noisy or missing, we propose a semi-supervised information bottleneck-based discrete latent variable model. We reparameterize the model's discrete variables with auxiliary continuous latent variables and a light-weight hierarchical structure. Our model is learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes past approaches, and perform an empirical case study of our approach on event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.

show abstract

Gradient Properties of Hard Thresholding Operator

Cited by 2 publications

References 29 publications

The Backpropagation algorithm for a math student

The Backpropagation algorithm for a math student

RevUp: Revise and Update Information Bottleneck for Event Representation

Contact Info

Product

Resources

About