Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

Soltanolkotabi, Mahdi; Javanmard, Adel; Lee, Jason D.

doi:10.1109/tit.2018.2854560

Cited by 272 publications

(219 citation statements)

References 45 publications

Supporting

Mentioning

210

Contrasting

Order By: Relevance

“…Therefore, For any hypothesis F (X) ∈ H v , there exists an element W (X) ∈ W , such that W (X) = W F S (X) + W F V (X) satisfying eqs. (47) and (48), and furthermore,…”

Section: Covering Bound For the Hypothesis Space Of Deep Neural Netwomentioning

confidence: 92%

Why ResNet Works? Residuals Generalize

Liu

Tao

2020

IEEE Trans. Neural Netw. Learning Syst.

211

View full text Add to dashboard Cite

Residual connections signi cantly boost the performance of deep neural networks. However, there are few theoretical results that address the in uence of residuals on the hypothesis complexity and the generalization ability of deep neural networks. This paper studies the in uence of residual connections on the hypothesis complexity of the neural network in terms of the covering number of its hypothesis space. We prove that the upper bound of the covering number is the same as chain-like neural networks, if the total numbers of the weight matrices and nonlinearities are xed, no matter whether they are in the residuals or not. This result demonstrates that residual connections may not increase the hypothesis complexity of the neural network compared with the chain-like counterpart. Based on the upper bound of the covering number, we then obtain an O(1/ √ N ) margin-based multi-class generalization bound for ResNet, as an exemplary case of any deep neural network with residual connections. Generalization guarantees for similar state-of-the-art neural network architectures, such as DenseNet and ResNeXt, are straight-forward. From our generalization bound, a practical implementation is summarized: to approach a good generalization ability, we need to use regularization terms to control the magnitude of the norms of weight matrices not to increase too much, which justi es the standard technique of weight decay.

show abstract

“…Therefore, For any hypothesis F (X) ∈ H v , there exists an element W (X) ∈ W , such that W (X) = W F S (X) + W F V (X) satisfying eqs. (47) and (48), and furthermore,…”

Section: Covering Bound For the Hypothesis Space Of Deep Neural Netwomentioning

confidence: 92%

Why ResNet Works? Residuals Generalize

Liu

Tao

2020

IEEE Trans. Neural Netw. Learning Syst.

211

View full text Add to dashboard Cite

show abstract

“…, d n − 1} and for all k ∈ [n]. 18 Why are we using complex features for our example instead of the real sines and cosines? Just because keeping track of which feature is an alias of which other feature is less notationally heavy for the complex case.…”

Section: Aliasing -The Core Issue In Overparameterized Modelsmentioning

confidence: 99%

“…≥ 3-layer neural networks.Promising progress has been made in all of these areas, which we recap only briefly below. Regarding the first point, while the optimization landscape for deep neural networks is non-convex and complicated, several independent recent works (an incomplete list is [12][13][14][15][16][17][18]) have shown that overparameterization can make it more attractive, in the sense that optimization algorithms like stochastic gradient descent (SGD) are more likely to actually converge to a global minimum. These interesting insights are mostly unrelated to the question of generalization, and should be viewed as a coincidental benefit of overparameterization.Second, a line of recent work [19][20][21][22] characterizes the inductive biases of commonly used optimization algorithms, thus providing insight into the identity of the global minimum that is selected.…”

mentioning

confidence: 99%

Harmless Interpolation of Noisy Data in Regression

Muthukumar¹,

Vodrahalli²,

Subramanian³

et al. 2020

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We characterize the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this error decays to zero with the number of features. Thus, overparameterization can be explicitly beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature -signal "bleeding" into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions. arXiv:1903.09139v2 [cs.LG] 9 Sep 2019 2. We provide a Fourier-theoretic interpretation of concurrent analyses [6-10] of the minimum 2 -norm interpolator.3. We show (Theorem 2) that parsimonious interpolators (like the 1 -minimizing interpolator and its relatives) suffer the complementary problem of overfitting pure noise.4. We construct two-step hybrid interpolators that successfully recover signal and harmlessly fit noise, achieving the order-optimal rate of test MSE among all interpolators (Proposition 1 and all its corollaries). Related workWe discuss prior work in three categories: a) overparameterization in deep neural networks, b) interpolation of high-dimensional data using kernels, and c) high-dimensional linear regression. We then recap work on overparameterized linear regression that is concurrent to ours. Recent interest in overparameterizationConventional statistical wisdom is that using more parameters in one's model than data points leads to poor generalization. This wisdom is corroborated in theory by worst-case generalization bounds on such overparameterized models following from VC-theory in classification [2] and ill-conditioning in least-squares regression [5]. It is, however, contradicted in practice by the notable recent trend of empirically successful overparameterized deep neural networks. For example, the commonly used CIFAR-10 dataset contains 60000 images, but the number of parameters in all the neural networks achieving state-of-the-art performance on CIFAR-10 is at least 1.5 million [4]. These neural networks have the ability to memorize pure noisesomehow, they are still able to generalize well when trained with meaningful data.Since the publication of this observation [4,11], the machine learning community has seen a flurry of activity to attempt to explain this phenomenon, both for classification and regression problems, in neural networks. The problem is challenging for three core reasons 2 :1. The optimization landscape for l...

show abstract

“…Assume that we are given data distributed according to a probability measure µ where (x, y) ∼ µ, and where R is often referred to as the risk or the risk function. In practice, the risk we have to minimize is the empirical risk, and it is a well-established fact that for neural networks the minimization problem in (1.6) is, in general, a non-convex minimization problem [33,2,35,8]. As such many search algorithms may get trapped at, or converge to, local minima which are not global minima [33].…”

Section: Introductionmentioning

confidence: 99%

Neural ODEs as the deep limit of ResNets with constant weights

Avelin

Nyström

2020

Anal. Appl.

View full text Add to dashboard Cite

In this paper we prove that, in the deep limit, the stochastic gradient descent on a ResNet type deep neural network, where each layer share the same weight matrix, converges to the stochastic gradient descent for a Neural ODE and that the corresponding value/loss functions converge. Our result gives, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets. Our proof is based on certain decay estimates for associated Fokker-Planck equations.2010 Primary: 68T05, 65L20, Secondary: 34A45, 35Q84, 62F10, 60H10

show abstract

Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

Cited by 272 publications

References 45 publications

Why ResNet Works? Residuals Generalize

Why ResNet Works? Residuals Generalize

Harmless Interpolation of Noisy Data in Regression

Neural ODEs as the deep limit of ResNets with constant weights

Contact Info

Product

Resources

About