Zeyuan Allen-Zhu scite author profile

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of sequential and parallel performance. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially try to give Katyusha a hug. * We would like to specially thank Shai Shalev-Shwartz for useful feedbacks and suggestions on this paper, thank Blake Woodworth and Nati Srebro for pointer to their paper [49], thank Guanghui Lan for correcting our citation of [16], thank Weston Jackson, Xu Chen and Zhe Li for verifying the proofs and correcting typos, and thank anonymous reviewers for a number of writing suggestions.

show abstract

Finding approximate local minima faster than gradient descent

Agarwal

Allen-Zhu²,

Bullins

et al. 2017

149

311

View full text Add to dashboard Cite

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

show abstract

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Allen-Zhu¹,

Li²,

Liang³

2018

Preprint

210

View full text Add to dashboard Cite

The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained neural networks overfit when the it is overparameterized (namely, having more parameters than statistically needed to overfit training data)?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the overparameterized network. * V1 appears on this date, V2/V3/V4/V5 polish writing and parameters, V5 adds experiments. Authors sorted in alphabetical order. We would like to thank Greg Yang and Sebastien Bubeck for many enlightening conversations.

show abstract

Using Optimization to Break the Epsilon Barrier: A Faster and Simpler Width-Independent Algorithm for Solving Positive Linear Programs in Parallel

Allen-Zhu¹,

Orecchia²

2014

124

View full text Add to dashboard Cite

We study the design of nearly-linear-time algorithms for approximately solving positive linear programs. Both the parallel and the sequential deterministic versions of these algorithms require O(ε −4 ) iterations, a dependence that has not been improved since the introduction of these methods in 1993 by Luby and Nisan. Moreover, previous algorithms and their analyses rely on update steps and convergence arguments that are combinatorial in nature, and do not seem to arise naturally from an optimization viewpoint. In this paper, we leverage insights from optimization theory to construct a novel algorithm that breaks the longstanding O(ε −4 ) barrier. Our algorithm has a simple analysis and a clear motivation. Our work introduces a number of novel techniques, such as the combined application of gradient descent and mirror descent, and a truncated, smoothed version of the standard multiplicative weight update, which may be of independent interest.

show abstract

Backward Feature Correction: How Deep Learning Performs Deep Learning

Allen-Zhu¹,

Li²

2020

Preprint

122

View full text Add to dashboard Cite

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD) to the training objective.On the conceptual side, we present, to the best of our knowledge, the first theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when no known non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks.On the technical side, we show for regression and even for binary classification, for every input dimension d > 0, there is a concept class consisting of degree ω(1) multi-variate polynomials so that, using ω(1)-layer neural networks as learners, SGD can learn any target function from this class in poly(d) time using poly(d) samples to any 1 poly(d) regression or classification error, through learning to represent it as a composition of ω(1) layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from super-polynomial d ω(1) sample or time complexity to learn functions in this concept class even to any d −0.01 error.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zeyuan Allen-Zhu

Katyusha: the first direct acceleration of stochastic gradient methods

Finding approximate local minima faster than gradient descent

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Using Optimization to Break the Epsilon Barrier: A Faster and Simpler Width-Independent Algorithm for Solving Positive Linear Programs in Parallel

Backward Feature Correction: How Deep Learning Performs Deep Learning

Contact Info

Product

Resources

About