Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Liu, Ziquan; Cui, Yufei; Chan, Antoni B.

doi:10.48550/arxiv.2008.02965

Cited by 4 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The l 2 regularization [30] is the most common method, which essentially incorporates the l 2 regularization term into the function structure while deriving the loss function. The l 1 regularization [31] is a type of Laplace prior distribution regularization, which adds a l 1 regularization term to the loss function of the fitting function so that some of the coefficients of the independent variables of the fitting function that are not correlated with the results can be compressed to 0 because the Weight Scale Shifting (WSS) in standard deep learning models may lead to a less pronounced effect of l 2 regularization [32]. Dropout [33][34][35][36] allows specific neurons to stop working during forward propagation of the model training phase with a certain probability.…”

Section: Regularization Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Diverse Characteristics of Users for Personalized Sequential Recommendation

Hu,

Zhao,

Chen

2023

Preprint

View full text Add to dashboard Cite

The dynamic sequence is a core feature in many modern recommendation systems. Transformer models have achieved significant success in machine translation, inspiring some researchers to introduce self-attention mechanisms into the sequential recommendation, yielding satisfactory results. However, these models share a common issue in that they lack consideration of the user's information, making it impossible to perform multi-level modeling of users accurately. Essentially, they are non-personalized models. To address the above challenges, this study proposes an approach called DCUPSRec (a personalized sequential recommendation model based on the diverse characteristics of users). This method establishes relationships between users and items in their historical interaction data and, within the sequential framework, models complex relationships between users based on their diverse characteristics and the impact of these relationships on recommended items. In addition, we use Stochastic Shared Embeddings (SSE) regularisation techniques to address potential overfitting problems caused by the introduction of users' diverse features. Extensive experiments using various users' features demonstrate that our approach transcends other sequential models when dealing with sparse and dense datasets and a variety of evaluation metrics.

show abstract

Section: Regularization Techniquesmentioning

confidence: 99%

“…Besides, there are also standard regularization methods such as parameter sharing [37], max-norm regularization [38], gradient clipping [39], WEISSI [32], Etc. In this paper, we utilize Stochastic Shared Embedding Regularization (SSE) [15], a data-driven regularization method.…”

Section: Regularization Techniquesmentioning

confidence: 99%

Diverse Characteristics of Users for Personalized Sequential Recommendation

Hu,

Zhao,

Chen

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Other recent works propose related forms of regularization, and argue that these are sometimes better than weight decay: In [24] introduced the "path regularizer", a generalization of the regularizer in [26] for deep neural networks and showed how it can lead to solutions that generalize better and are more robust [4,15]. Similarly, [21] utilize the homogeneity of ReLU neural network and proposed "scale shift invariant" algorithm. Proximal gradient type of algorithm has been proposed for 1-path-norm in [18], where they focus on the w 1 v 1 norm of a homogeneous unit (w, v) in shallow networks.…”

Section: Related Workmentioning

confidence: 99%

A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

Yang¹,

Zhang²,

Shenouda³

et al. 2022

Preprint

View full text Add to dashboard Cite

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of 2 (not squared) norms of the input and output weights associated each ReLU. This alternative (and effectively equivalent) regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the sparse solutions it shares with standard weight decay training.

show abstract

“…The weight scale shifting issue is also discussed in adversary [22], that is, scale of weights can be shifted between layers without changing the input-output function specified by the network, which could affect the capacity to regularize models. Then one weight scale shift invariant regularization is proposed and improves adversarial robustness.…”

Section: Linearity Exploration In Adversarymentioning

confidence: 99%

Clustering Effect of (Linearized) Adversarial Robust Models

Bai¹,

Yan²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Adversarial robustness has received increasing attention along with the study of adversarial examples. So far, existing works show that robust models not only obtain robustness against various adversarial attacks but also boost the performance in some downstream tasks. However, the underlying mechanism of adversarial robustness is still not clear. In this paper, we interpret adversarial robustness from the perspective of linear components, and find that there exist some statistical properties for comprehensively robust models. Specifically, robust models show obvious hierarchical clustering effect on their linearized sub-networks, when removing or replacing all non-linear components (e.g., batch normalization, maximum pooling, or activation layers). Based on these observations, we propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.

show abstract

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Cited by 4 publications

References 12 publications

Diverse Characteristics of Users for Personalized Sequential Recommendation

Diverse Characteristics of Users for Personalized Sequential Recommendation

A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

Clustering Effect of (Linearized) Adversarial Robust Models

Contact Info

Product

Resources

About