2016
DOI: 10.48550/arxiv.1609.04836
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

12
422
1
1

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 307 publications
(436 citation statements)
references
References 0 publications
12
422
1
1
Order By: Relevance
“…Our result lends support to the belief that the noise introduced by Stochastic Gradient Descent (SGD) is superior to the isotropic noise, which has been widely observed [18,36,40,38]. As an illustrative example, we plot the generalization errors of SGD, SGLD with the isotropic noise and SGLD with the optimal noise in Figure 1, where their training curves behave almost the same (do not show here).…”
Section: Introductionsupporting
confidence: 76%
“…Our result lends support to the belief that the noise introduced by Stochastic Gradient Descent (SGD) is superior to the isotropic noise, which has been widely observed [18,36,40,38]. As an illustrative example, we plot the generalization errors of SGD, SGLD with the isotropic noise and SGLD with the optimal noise in Figure 1, where their training curves behave almost the same (do not show here).…”
Section: Introductionsupporting
confidence: 76%
“…Analysis on the shape of the loss landscape For more explicit investigation, we analyze DCutMix with its loss landscape. Flatness of the loss landscape near local minima has been considered as one key signal for achieving better generalization of the model in various situations, by the number of previous studies (Keskar et al 2016;Pereyra et al 2017;Zhang et al 2018;Chaudhari et al 2019;Cha et al 2020). The general interpretation on the shape of the loss landscape is that if a model converges to a wide (flat) local minima, the model tends to have better generalization performance on unseen test dataset.…”
Section: Experimental Results On Image Classification Cifar-10/100mentioning
confidence: 99%
“…However, we argue that such linear connectivity can only be guaranteed when the loss surface around local minima is flat [8,17,18], which allows the low-error ellipsoid around each optimum to be wide enough to overlap with each other. However, when fine-tuning with extremely small data, a model is often suffering from large-batch training dilemma [19], which results in sharp and narrow minima. Thus the chance of existing loss-smooth linear connector between different tasks quickly becomes very small as the learning continues, illustrated as in Figure 2 (a).…”
Section: Stable Moment Matching (Smm)mentioning
confidence: 99%