2016
DOI: 10.48550/arxiv.1611.01838
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
133
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 81 publications
(138 citation statements)
references
References 0 publications
5
133
0
Order By: Relevance
“…Furthermore, our numerical experiments verify that the Jacobian matrix of real datasets (such as CIFAR10) indeed exhibit low-rank structure. This is closely related to the observations on the Hessian of deep networks which is empirically observed to be low-rank [15,44]. An equally important question for understanding the convergence behavior of optimization algorithms for overparameterized models is understanding their generalization capabilities.…”
Section: Prior Artmentioning
confidence: 68%
“…Furthermore, our numerical experiments verify that the Jacobian matrix of real datasets (such as CIFAR10) indeed exhibit low-rank structure. This is closely related to the observations on the Hessian of deep networks which is empirically observed to be low-rank [15,44]. An equally important question for understanding the convergence behavior of optimization algorithms for overparameterized models is understanding their generalization capabilities.…”
Section: Prior Artmentioning
confidence: 68%
“…(v) For potentials that do not admit an obvious decomposition like (1.2), we propose using local entropy approximation, [19,20] to extract the large scale information needed for either the Modified MALA method or the independence sampler.…”
Section: Results On Performance In the Presence Of Roughnessmentioning
confidence: 99%
“…One option is to use physical intuition about the problem to identify a potential U (x) that has suitable properties. More systematically, we can use the local entropy approach formulated in [19,20], or, equivalently the Moreau-Yosida approximation to estimate a smoothed version of V (x).…”
Section: Finding Smoothed Landscapesmentioning
confidence: 99%
“…In recent years, there were lots of efforts to mathematically explain the generalization capability of DNNs by using variety of tools. They range from attributing it to the way that the SGD method automatically finds flat local minima (which are stable and thus can be well generalized) [30,31,32,33], to efforts trying to relate the success of DNNs to the special class of hierarchical functions that they generate [34].…”
Section: Information Bottleneck and Stochastic Gradient Descentmentioning
confidence: 99%