2019
DOI: 10.1088/1742-5468/ab39d9
|View full text |Cite
|
Sign up to set email alerts
|

Entropy-SGD: biasing gradient descent into wide valleys*

Abstract: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while av… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

7
411
1

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 361 publications
(430 citation statements)
references
References 50 publications
7
411
1
Order By: Relevance
“…Intuitively, our idea of MSKT is in line with recent studies on the robustness of high posterior entropy [40], [44]. In our MSKT, each auxiliary branch only aims to better learn the specific knowledge from a certain dataset.…”
Section: B Multi-site-guided Knowledge Transfermentioning
confidence: 94%
See 1 more Smart Citation
“…Intuitively, our idea of MSKT is in line with recent studies on the robustness of high posterior entropy [40], [44]. In our MSKT, each auxiliary branch only aims to better learn the specific knowledge from a certain dataset.…”
Section: B Multi-site-guided Knowledge Transfermentioning
confidence: 94%
“…While in MSKT, the universal network has to mimic the ground truth label and the predictions of multiple auxiliary branches simultaneously. Compared with conventional supervised learning, MSKT provides additional multi-site information to regularize the universal network and increases its posterior entropy [40], which helps the shared kernels to explore more robust representation among multiple datasets. Moreover, the multi-branch architecture in MSKT could also perform as a positive feature regularization to the universal encoder by jointly training the auxiliary branches and the universal network.…”
Section: B Multi-site-guided Knowledge Transfermentioning
confidence: 99%
“…This prevents the dynamic from getting trapped in local maxima, which is crucial for non-convex optimization landscapes, and it also directs the dynamics toward minima with wider basins of attraction [23]. It has been argued that the later effect contributes in improving generalization performance [24][25][26]. Though the convergence rate of SGA has a slower asymptotic rate than ordinary gradient descent, it often does not matter in practice for finite data sets, as the performance on the test set usually does not improve anymore once the asymptotic regime is reached [27].…”
Section: Stochastic Optimizationmentioning
confidence: 99%
“…Intuitively, one expects weights with a high local entropy to generalize well since they are more robust with respect to perturbations of the parameters and the data and therefore less likely to be an artifact of overfitting. In fact, such flat minima have already been found to have better generalization properties for some deep architectures [15]. We call the optimization procedure that finds such minima robust optimization.…”
Section: Replicated Systems and Overfittingmentioning
confidence: 97%
“…Therefore we take special measures to avoid overfitting, which would decrease generalization performance and would constitute evidence that the current representation in the hidden units is not corresponding to the basic features in the data. We note that in neural networks, overfitting has been connected to sharp minima in the loss function [14,15,16]. To avoid such minima, we modify the optimization procedure to prefer weights that are in the vicinity of other weights that have low loss, which is a measure of the flatness of the loss landscape.…”
Section: Replicated Systems and Overfittingmentioning
confidence: 99%