Global optimality conditions for deep neural networks

Yun, Chulhee; Sra, Suvrit; Jadbabaie, Ali

doi:10.48550/arxiv.1707.02444

Cited by 22 publications

(29 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this reason, deep linear networks have been the subject of extensive theoretical analysis. A line of work (Kawaguchi, 2016;Hardt & Ma, 2016;Lu & Kawaguchi, 2017;Yun et al, 2017;Zhou & Liang, 2018;Laurent & von Brecht, 2018) studied the landscape properties of deep linear networks. Although it was established that all local minima are global under certain assumptions, these properties alone are still not sufficient to guarantee global convergence or to provide a concrete rate of convergence for gradient-based optimization algorithms.…”

Section: Related Workmentioning

confidence: 99%

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Hu¹,

Xiao²,

Pennington³

2020

Preprint

View full text Add to dashboard Cite

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

show abstract

Section: Related Workmentioning

confidence: 99%

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Hu¹,

Xiao²,

Pennington³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The training loss of multilayer neural networks at differentiable local minima was examined in [38]. Yun et al [44] very recently provided sufficient and necessary conditions to guarantee that certain critical points are also global minima.…”

Section: Introductionmentioning

confidence: 99%

The Global Optimization Geometry of Shallow Linear Neural Networks

et al. 2019

View full text Add to dashboard Cite

We examine the squared error loss landscape of shallow linear neural networks. We show-with significantly milder assumptions than previous worksthat the corresponding optimization problems have benign geometric properties: there are no spurious local minima and the Hessian at every saddle point has at least one negative eigenvalue. This means that at every saddle point there is a directional negative curvature which algorithms can utilize to further decrease the objective value. These geometric properties imply that many local search algorithms (such as the gradient descent which is widely utilized for training neural networks) can provably solve the training problem with global convergence. 1From an optimization perspective, non-strict saddle points and local minima have similar first-/second-order information and it is hard for first-/second-order methods (like gradient descent) to distinguish between them.

show abstract

“…Besides characterizing local minima, stronger claims on the stationary points can be proved for linear networks. Yun et al [240] and Zou et al [253] present necessary and sufficient conditions for a stationary point to be a global minimum.…”

Section: Global Landscape Analysis Of Deep Networkmentioning

confidence: 99%

Optimization for deep learning: theory and algorithms

Sun

2019

Preprint

View full text Add to dashboard Cite

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and existing theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinitewidth analysis.

show abstract

Global optimality conditions for deep neural networks

Cited by 22 publications

References 8 publications

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The Global Optimization Geometry of Shallow Linear Neural Networks

Optimization for deep learning: theory and algorithms

Contact Info

Product

Resources

About