2019
DOI: 10.48550/arxiv.1906.10822
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…In our paper, we show that parameter averaging stabilizes the convergence to a flat region or asymmetric valley, and suggest the usefulness of the combination with the large step size for the difficult dataset which needs a stronger regularization. Besides, several authors proposed methods that explicitly inject noise for improving generalization (Chaudhari et al, 2019), in particular for the large batch setting (Wen et al, 2018;Haruki et al, 2019;Lin et al, 2020). 2018) explained that SGD travels on the hypersphere because of the convergence to Gaussian distribution and the concentration on the sphere under a simplified setting, and thus averaging scheme allows us to go inside of the sphere which may be flat.…”
Section: Discussionmentioning
confidence: 99%
“…In our paper, we show that parameter averaging stabilizes the convergence to a flat region or asymmetric valley, and suggest the usefulness of the combination with the large step size for the difficult dataset which needs a stronger regularization. Besides, several authors proposed methods that explicitly inject noise for improving generalization (Chaudhari et al, 2019), in particular for the large batch setting (Wen et al, 2018;Haruki et al, 2019;Lin et al, 2020). 2018) explained that SGD travels on the hypersphere because of the convergence to Gaussian distribution and the concentration on the sphere under a simplified setting, and thus averaging scheme allows us to go inside of the sphere which may be flat.…”
Section: Discussionmentioning
confidence: 99%
“…To fairly measure essential performance gains by our method, we reproduced recent other SOAT optimization methods which include K-FAC 30 and GNC. 31 The baseline we used in these cases is not applied any optimizations, both instances use a multi-steps learning rate policy. In distributed training scenarios, the ImageNet baseline results include floating-point randomness, due to performing allreduce, which is a collective communication operation among multi-nodes.…”
Section: Compare With Other Optimization Techniquesmentioning
confidence: 99%
“…To overcome the generalization performance degrading in the large-bath training, Wen 32 ad Haruki 31 proposed gradient noise injecting methods to eliminate sharp minima by adding empirically gradient noise to weight, because small-batch SGD introduces noisy behavior that allows escaping from sharp minima. 16 Haruki claimed that their method can achieve good results on ResNet-50 training ImageNet-1K with batch size 32 K, but not on 128 K. Moreover, due to the randomness of gradient noise, it could not achieve a good performance every time.…”
Section: Related Workmentioning
confidence: 99%
“…Lookahead (Zhang et al, 2019) as of our interest here in spirit is closer to extrapolation methods (Korpelevich, 1976) which rely on gradients taken not at the current iterate but at an extrapolated point for the current trajectory. For highly complex optimization landscapes such as in deep learning, the effect of using gradients at perturbations of the current iterate has a desirable smoothing effect which is known to help training speed and stability in the case of non-convex single-objective optimization (Wen et al, 2018;Haruki et al, 2019) GANs. Several proposed methods for GANs are motivated by the "recurrent dynamics".…”
Section: Related Workmentioning
confidence: 99%