Lower Bounds for Non-Convex Stochastic Optimization

Arjevani, Yossi; Carmon, Yair; Duchi, John C.; Foster, Dylan J.; Srebro, Nathan; Woodworth, Blake

doi:10.48550/arxiv.1912.02365

Cited by 72 publications

(170 citation statements)

References 22 publications

Supporting

Mentioning

161

Contrasting

Order By: Relevance

“…It is the first analysis of stochastic algorithms for NC-PL minimax problems. The dependency on is optimal, because the lower complexity bound of Ω( −4 ) for stochastic nonconvex optimization [Arjevani et al, 2019] still holds when considering f (x, y) = F (x) for some nonconvex function F (x). Even under the strictly stronger assumption of imposing strong-concavity in y, to the best of our knowledge, it is the first time that vanilla stochastic GDA-type algorithm is showed to achieve O( −4 ) sample complexity without either increasing batch size as in [Lin et al, 2020a] or Lipschitz continuity of f (•, y) and its Hessian as in [Chen et al, 2021b].…”

Section: Notationsmentioning

confidence: 99%

Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Yang¹,

Orvieto²,

Lucchi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Gradient descent ascent (GDA), the simplest single-loop algorithm for nonconvex minimax optimization, is widely used in practical applications such as generative adversarial networks (GANs) and adversarial training. Albeit its desirable simplicity, recent work shows inferior convergence rates of GDA in theory even assuming strong concavity of the objective on one side. This paper establishes new convergence results for two alternative single-loop algorithms -alternating GDA and smoothed GDA -under the mild assumption that the objective satisfies the Polyak-Lojasiewicz (PL) condition about one variable. We prove that, to find an -stationary point, (i) alternating GDA and its stochastic variant (without mini batch) respectively require O(κ 2 −2 ) and O(κ 4 −4 ) iterations, while (ii) smoothed GDA and its stochastic variant (without mini batch) respectively require O(κ −2 ) and O(κ 2 −4 ) iterations. The latter greatly improves over the vanilla GDA and gives the hitherto best known complexity results among single-loop algorithms under similar settings. We further showcase the empirical efficiency of these algorithms in training GANs and robust nonlinear regression.

show abstract

Section: Notationsmentioning

confidence: 99%

Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Yang¹,

Orvieto²,

Lucchi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In our case, N remains constant since B k benefits from warm-start. The faster rates of MRBO/VRBO (Yang et al, 2021) are obtained under the additional mean-squared smoothness assumption (Arjevani et al, 2019), which we do not investigate in the present work. Such assumption allows to achieve the improved complexity of O( −3/2 log( −1 )).…”

Section: Complexity Analysismentioning

confidence: 74%

“…The dependence on κ L and κ g for TTSA and AccBio are derived in Proposition 11 of Appendix A.4. The rate of MRBO/VRBO is obtained under the additional mean-squared smoothness assumption (Arjevani et al, 2019).…”

Section: General Setting and Main Assumptionsmentioning

confidence: 99%

Amortized Implicit Differentiation for Stochastic Bilevel Optimization

Arbel,

Mairal

2021

Preprint

View full text Add to dashboard Cite

We study a class of algorithms for solving bilevel optimization problems in both stochastic and deterministic settings when the inner-level objective is strongly convex. Specifically, we consider algorithms based on inexact implicit differentiation and we exploit a warm-start strategy to amortize the estimation of the exact gradient. We then introduce a unified theoretical framework inspired by the study of singularly perturbed systems (Habets, 1974) to analyze such amortized algorithms. By using this framework, our analysis shows these algorithms to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization. We illustrate these findings on synthetic experiments and demonstrate the efficiency of these algorithms on hyper-parameter optimization experiments involving several thousands of variables.

show abstract

“…While this rate is optimal in the general case, it is known that one can obtain an improved rate of O(1/T 1/3 ) if the objective is an expectation over smooth losses [Fang et al, 2018, Zhou et al, 2018, Cutkosky and Orabona, 2019, Tran-Dinh et al, 2019. Besides, this rate was recently shown to be tight [Arjevani et al, 2019].…”

Section: Introductionmentioning

confidence: 94%

“…In the context of stochastic non-convex optimization with general smooth losses, it was shown in Ghadimi and Lan [2013] that SGD with an appropriately selected learning rate can obtain a rate of O(1/T 1/4 ) for finding an approximate stationary point, which is known to match the respective lower bound [Arjevani et al, 2019]. While the method of Ghadimi and Lan [2013] requires knowledge of the smoothness and variance parameters, recent works have shown that adaptive methods like AdaGrad are able to obtain this bound in a parameter free manner, as well as to adapt to the variance of the problem [Li and Orabona, 2019, Ward et al, 2019, Reddi et al, 2018.…”

Section: Related Workmentioning

confidence: 99%

STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization

Levy¹,

Kavis²,

Cevher³

2021

Preprint

View full text Add to dashboard Cite

In this work we investigate stochastic non-convex optimization problems where the objective is an expectation over smooth loss functions, and the goal is to find an approximate stationary point. The most popular approach to handling such problems is variance reduction techniques, which are also known to obtain tight convergence rates, matching the lower bounds in this case. Nevertheless, these techniques require a careful maintenance of anchor points in conjunction with appropriately selected "mega-batchsizes". This leads to a challenging hyperparameter tuning problem, that weakens their practicality. Recently, [Cutkosky and Orabona, 2019] have shown that one can employ recursive momentum in order to avoid the use of anchor points and large batchsizes, and still obtain the optimal rate for this setting. Yet, their method called STORM crucially relies on the knowledge of the smoothness, as well a bound on the gradient norms. In this work we propose STORM + , a new method that is completely parameter-free, does not require large batch-sizes, and obtains the optimal O(1/T 1/3 ) rate for finding an approximate stationary point. Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.

show abstract

Lower Bounds for Non-Convex Stochastic Optimization

Cited by 72 publications

References 22 publications

Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Amortized Implicit Differentiation for Stochastic Bilevel Optimization

STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization

Contact Info

Product

Resources

About