The Implicit Bias of Gradient Descent on Separable Data

Soudry, Daniel; Hoffer, Elad; Nacson, Mor Shpigel; Gunasekar, Suriya; Srebro, Nathan

doi:10.48550/arxiv.1710.10345

Cited by 26 publications

(49 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(III) Implicit bias of SGD: Numerous empirical evidences have already shown that RNNs trained by stochastic gradient descent (SGD) algorithms have superior generalization performance. There have been a few theoretical results showing that SGD tends to yield low complexity models, which can generalize (Neyshabur et al, 2014(Neyshabur et al, , 2015Zhang et al, 2016;Soudry et al, 2017). Can we extend this argument to RNNs?…”

Section: Extensions To Mgu Lstm and Conv Rnnsmentioning

confidence: 99%

On Generalization Bounds of a Family of Recurrent Neural Networks

Chen¹,

Li²,

Zhao³

2019

Preprint

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) have been widely applied to sequential data analysis. Due to their complicated modeling structures, however, the theory behind is still largely missing. To connect theory and practice, we study the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) RNNs. Specifically, our theory is established under the PAC-Learning framework. The generalization bound is presented in terms of the spectral norms of the weight matrices and the total number of parameters. We also establish refined generalization bounds with additional norm assumptions, and draw a comparison among these bounds. We remark: (1) Our generalization bound for vanilla RNNs is significantly tighter than the best of existing results; (2) We are not aware of any other generalization bounds for MGU, LSTM, and Conv RNNs in the exiting literature; (3) We demonstrate the advantages of these variants in generalization.

show abstract

Section: Extensions To Mgu Lstm and Conv Rnnsmentioning

confidence: 99%

On Generalization Bounds of a Family of Recurrent Neural Networks

Chen¹,

Li²,

Zhao³

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…One study shows that the logistic regression model and 2-layer neural networks using monotone decreasing loss functions tend to converge in the direction of the max-margin solution when using GD and SGD (Soudry et al, 2017). We further enhance the conclusion by doing studies on more practical deep learning systems.…”

Section: Related Workmentioning

confidence: 67%

“…As deep neural networks remain mysterious for reasons, many researchers tried to reveal the inside logic starting from shallow models (Mianjy et al, 2018;Soudry et al, 2017;Gunasekar et al, 2017). It is useful to appeal to the simple case of shallow neural network models to see if there are parallel insights that can help us understand generalization better before we move on to the deep learning systems in the next section.…”

Section: Shallow Neural Network Experimentsmentioning

confidence: 99%

“…We build a simple 3-layer neural network on a non-linearly separable dataset as one step ahead from analyzing logistic regression on linearly separable data (Soudry et al, 2017). The dataset of size 200 with two interleaving half circles and 0.35 standard deviation of Gaussian noise is what we choose to let our model learn.…”

Section: Shallow Neural Network Experimentsmentioning

confidence: 99%

See 1 more Smart Citation

Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

Lei,

Sun,

Xiao

et al. 2018

Preprint

View full text Add to dashboard Cite

Deep neural networks with remarkably strong generalization performances are usually overparameterized. Despite explicit regularization strategies are used for practitioners to avoid over-fitting, the impacts are often small. Some theoretical studies have analyzed the implicit regularization effect of stochastic gradient descent (SGD) on simple machine learning models with certain assumptions. However, how it behaves practically in state-of-the-art models and real-world datasets is still unknown. To bridge this gap, we study the role of SGD implicit regularization in deep learning systems. We show pure SGD tends to converge to minimas that have better generalization performances in multiple natural language processing (NLP) tasks. This phenomenon coexists with dropout, an explicit regularizer. In addition, neural network's finite learning capability does not impact the intrinsic nature of SGD's implicit regularization effect. Specifically, under limited training samples or with certain corrupted labels, the implicit regularization effect remains strong. We further analyze the stability by varying the weight initialization range. We corroborate these experimental findings with a decision boundary visualization using a 3-layer neural network for interpretation. Altogether, our work enables a deepened understanding on how implicit regularization affects the deep learning model and sheds light on the future study of the overparameterized model's generalization ability.

show abstract

“…This is in contrast to the SVM method which can be used to find a particularly good (i.e., large margin) linear separator. The behavior of computational schemes for LR when the dataset is separable is not so well understood in theory, though there is recent work on first-order methods [16], [31], [18], [14], [19]. One of the main goals of this paper is to formalize the "informal" computational and statistical intuitions regarding logistic regression and to provide formal results that validate (or run counter to) such intuitive statements.…”

Section: Introductionmentioning

confidence: 99%

Condition Number Analysis of Logistic Regression, and its Implications for Standard First-Order Solution Methods

Freund¹,

Grigas²,

Mazumder³

2018

Preprint

View full text Add to dashboard Cite

Logistic regression is one of the most popular methods in binary classification, wherein estimation of model parameters is carried out by solving the maximum likelihood (ML) optimization problem, and the ML estimator is defined to be the optimal solution of this problem. It is well known that the ML estimator exists when the data is non-separable, but fails to exist when the data is linearly separable. First-order methods are the algorithms of choice for solving largescale instances of the logistic regression problem. In this paper, we introduce a pair of condition numbers that measure the degree of non-separability or separability of a given dataset in the setting of binary classification, and we study how these condition numbers relate to and inform the properties and the convergence guarantees of first-order methods. When the training data is non-separable, we show that the degree of non-separability naturally enters the analysis and informs the properties and convergence guarantees of two standard first-order methods: steepest descent (for any given norm) and stochastic gradient descent. Expanding on the work of Bach, we also show how the degree of non-separability enters into the analysis of linear convergence of steepest descent (without needing strong convexity), as well as the adaptive convergence of stochastic gradient descent. When the training data is separable, first-order methods rather curiously have good empirical success -a behavior that is not well understood in theory. In the case of separable data, we demonstrate how the degree of separability enters into the analysis of ℓ 2 steepest descent and stochastic gradient descent for delivering approximate-maximum-margin solutions with associated computational guarantees as well. This suggests that first-order methods can lead to statistically meaningful solutions in the separable case, even though the ML solution does not exist.

show abstract

The Implicit Bias of Gradient Descent on Separable Data

Cited by 26 publications

References 3 publications

On Generalization Bounds of a Family of Recurrent Neural Networks

On Generalization Bounds of a Family of Recurrent Neural Networks

Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

Condition Number Analysis of Logistic Regression, and its Implications for Standard First-Order Solution Methods

Contact Info

Product

Resources

About