2021
DOI: 10.1007/s10994-021-06056-w
|View full text |Cite
|
Sign up to set email alerts
|

Understanding generalization error of SGD in nonconvex optimization

Abstract: The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study it. Existing generalization bounds based on stability do not incorporate the interplay between the optimization of SGD and the underlying data distribution, and hence cannot even capture the effect of randomized labels on the generalization performance. In this paper, we establish generalization error bounds for SGD by character… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(14 citation statements)
references
References 24 publications
1
13
0
Order By: Relevance
“…t , and ∇f w,Jt 1 m z∈Jt ∇f (w, z) (50) the iterate stability error of the mini-batch ZoSS GJt (w) − G′ J ′ t (w ′ ) at time t and similarly to (5), we show that GJt (w) − G′…”
Section: Lemma 12 (Mini-batch Sgd Growth Recursion) Fix Arbitrary Seq...supporting
confidence: 58%
See 1 more Smart Citation
“…t , and ∇f w,Jt 1 m z∈Jt ∇f (w, z) (50) the iterate stability error of the mini-batch ZoSS GJt (w) − G′ J ′ t (w ′ ) at time t and similarly to (5), we show that GJt (w) − G′…”
Section: Lemma 12 (Mini-batch Sgd Growth Recursion) Fix Arbitrary Seq...supporting
confidence: 58%
“…More recent works develop alternative generalization error bounds based on high-probability analysis [38][39][40][41] and data-dependent variants [42], or under different assumptions than those of prior works such as as strongly quasi-convex [43], non-smooth convex [44][45][46][47], and pairwise losses [48,49]. In the nonconvex case, [50] provide bounds that involve on-average variance of the stochastic gradients. Generalization performance of other algorithmic variants lately gain further attention, including SGD with early momentum [51], randomized coordinate descent [52], look-ahead approaches [53], noise injection methods [54], and stochastic gradient Langevin dynamics [55][56][57][58][59][60][61][62].…”
Section: Introductionmentioning
confidence: 99%
“…More recent works develop alternative generalization error bounds based on high-probability analysis [7][8][9][10] and data-dependent variants [11], or under weaker assumptions such as as strongly quasi-convex [12], non-smooth convex [13][14][15][16], and pairwise losses [17,18]. In the nonconvex case, [19] provide bounds that involve on-average variance of the stochastic gradients.…”
Section: Introductionmentioning
confidence: 99%
“…London (2017); combined stability bounds with the PAC-Bayesian approach we be discussed later. Zhou et al (2019) proved datadependent stability bounds that apply to SGD with multiple passes over the data. However, their bound increases with training time (although logarithmically rather than polynomially as in Hardt et al (2016); Mou et al (2018)), contradicting the empirical result that generalization error appears to pleateau with training time (Hoffer et al, 2017).…”
Section: Stability-based Boundsmentioning
confidence: 99%
“…The data-dependent stability bound inZhou et al (2019) correlated with the true error when varying the amount of label corruption on three different datasets.Li et al (…”
mentioning
confidence: 99%