Yair Carmon scite author profile

We prove impossibility results for adaptivity in non-smooth stochastic convex optimization. Given a set of problem parameters we wish to adapt to, we define a "price of adaptivity" (PoA) that, roughly speaking, measures the multiplicative increase in suboptimality due to uncertainty in these parameters. When the initial distance to the optimum is unknown but a gradient norm bound is known, we show that the PoA is at least logarithmic for expected suboptimality, and double-logarithmic for median suboptimality. When there is uncertainty in both distance and gradient norm, we show that the PoA must be polynomial in the level of uncertainty. Our lower bounds nearly match existing upper bounds, and establish that there is no parameter-free lunch.

show abstract

Lower Bounds for Non-Convex Stochastic Optimization

Arjevani¹,

Carmon²,

Duchi³

et al. 2019

Preprint

160

View full text Add to dashboard Cite

We lower bound the complexity of finding -stationary points (with gradient norm at most ) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least −4 queries to find an stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of −3 queries, establishing the optimality of recently proposed variance reduction techniques.

show abstract

Lower bounds for finding stationary points I

et al. 2019

View full text Add to dashboard Cite

We prove lower bounds on the complexity of finding -stationary points (points x such that ∇f (x) ≤ ) of smooth, high-dimensional, and potentially non-convex functions f . We consider oracle-based complexity measures, where an algorithm is given access to the value and all derivatives of f at a query point x. We show that for any (potentially randomized) algorithm A, there exists a function f with Lipschitz pth order derivatives such that A requires at least −(p+1)/p queries to find an -stationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newton's method, and generalized pth order regularization are worst-case optimal within their natural function classes.

show abstract

No bad local minima: Data independent training error guarantees for multilayer neural networks

Soudry¹,

Carmon²

2016

Preprint

115

View full text Add to dashboard Cite

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

show abstract

Gradient Descent Finds the Cubic-Regularized Nonconvex Newton Step

Carmon¹,

Duchi²

2019

SIAM J. Optim.

View full text Add to dashboard Cite

We consider the minimization of non-convex quadratic forms regularized by a cubic term, which exhibit multiple saddle points and poor local minima. Nonetheless, we prove that, under mild assumptions, gradient descent approximates the global minimum to within ε accuracy in O(ε −1 log(1/ε)) steps for large ε and O(log(1/ε)) steps for small ε (compared to a condition number we define), with at most logarithmic dependence on the problem dimension. When we use gradient descent to approximate the Nesterov-Polyak cubic-regularized Newton step, our result implies a rate of convergence to second-order stationary points of general smooth non-convex functions.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yair Carmon

Accelerated Methods for NonConvex Optimization

Lower Bounds for Non-Convex Stochastic Optimization

Lower bounds for finding stationary points I

No bad local minima: Data independent training error guarantees for multilayer neural networks

Gradient Descent Finds the Cubic-Regularized Nonconvex Newton Step

Contact Info

Product

Resources

About