Katyusha: the first direct acceleration of stochastic gradient methods

Allen-Zhu, Zeyuan

doi:10.1145/3055399.3055448

Cited by 268 publications

(723 citation statements)

References 21 publications

Supporting

Mentioning

707

Contrasting

Unclassified

Order By: Relevance

“…This results in an improved computational complexity of O (n + κ) log 1 ǫ passes over the data set to achieve an ǫ-optimal solution in expectation. When these methods are combined with Nesterov acceleration, the expected complexity becomes O ((n + √ nκ) log(1/ǫ)) (see, e.g., [7] and [1]).…”

Section: Stochastic Optimization Of Least Squaresmentioning

confidence: 99%

See 1 more Smart Citation

Noisy Accelerated Power Method for Eigenproblems With Applications

Vien

Johansson

2019

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

This paper introduces an efficient algorithm for finding the dominant generalized eigenvectors of a pair of symmetric matrices. Combining tools from approximation theory and convex optimization, we develop a simple scalable algorithm with strong theoretical performance guarantees. More precisely, the algorithm retains the simplicity of the well-known power method but enjoys the asymptotic iteration complexity of the powerful Lanczos method. Unlike these classic techniques, our algorithm is designed to decompose the overall problem into a series of subproblems that only need to be solved approximately. The combination of good initializations, fast iterative solvers, and appropriate error control in solving the subproblems lead to a linear running time in the input sizes compared to the superlinear time for the traditional methods. The improved running time immediately offers acceleration for several applications. As an example, we demonstrate how the proposed algorithm can be used to accelerate canonical correlation analysis, which is a fundamental statistical tool for learning of a low-dimensional representation of high-dimensional objects. Numerical experiments on real-world data sets confirm that our approach yields significant improvements over the current state-of-the-art. Stochastic optimization for PCAOver the past few years, the challenges of dealing with huge data sets have inspired the development of novel optimization algorithms for empirical risk minimization (see, e.g., [3] and references therein). Leveraging ideas from stochastic optimization, several scalable algorithms have also been proposed for different eigenvalue problems [8,14,21,26,27]. These algorithms try to combine the low iteration cost of the stochastic methods and the high accuracy of the standard deterministic techniques. Notably, borrowing the idea of variance reduction [15], the author in [26] studied a variant of Oja's algorithm [19], giving the first linearly convergent algorithm for stochastic PCA. The authors of [8] improved the result further by using the well-known shift-and-invert technique from numerical linear algebra [11,22]. Inspired by the momentum method as known as the Heavy ball method in the optimization literature [20], an accelerated stochastic PCA was proposed in [21].

show abstract

Section: Stochastic Optimization Of Least Squaresmentioning

confidence: 99%

“…In this section, we introduce NAPI, an noisy accelerated power method for solving (1). We then characterize the convergence rate of the proposed algorithm for the special case of computing the leading generalized eigenvector.…”

Section: Computing the Leading Generalized Eigenvectormentioning

confidence: 99%

Noisy Accelerated Power Method for Eigenproblems With Applications

Vien

Johansson

2019

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

show abstract

“…[14] Recently, Allen-Zhu provides a new momentum named Katyusha momentum and achieves great performance in many settings. [3] An illustration of the difference among SGD, Momentum & Nesterov's momentum is Fig. 3: Fig.…”

Section: The Variants Of Sgdmentioning

confidence: 99%

The Frontier of SGD and its Variants in Machine Learning

Du¹

2017

2017 2nd International Conference on Mechatronics and Information Technology (ICMIT 2017)

View full text Add to dashboard Cite

Abstract. Numerical optimization is a classical field in operation research and computer science, which has been widely used in the areas such as physics and economics. Although, optimization algorithms have achieved great success for plenty of applications, handling the big data in the best fashion possible is a very inspiring and demanding challenge in the artificial intelligence era. Stochastic gradient descent (SGD) is pretty simple but surprisingly, highly effective in machine learning models, such as support vector machine (SVM) and deep neural network (DNN). Theoretically, the performance of SGD for convex optimization is well understood. But, for the non-convex setting, which is very common for the machine learning problems, to obtain the theoretical guarantee for SGD and its variants is still a standing problem. In the paper, we do a survey about the SGD and its variants such as Momentum, ADAM and SVRG, differentiate their algorithms and applications and present some recent breakthrough and open problems.

show abstract

“…where η > 0 is the step size. Dual averaging (DA, [16]) algorithm is another widely used algorithm for solving (1), which iterates as…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we develop a new dual-averaging primal-dual (DAPD) method for solving (1), which has accelerated optimal convergence rate. When f (Ax) has a finitesum structure, we develop a stochastic version of DAPD, named SDAPD, which is also optimal, and has better overall complexity on sparse data comparing with existing algorithms of the same type.…”

Section: Introductionmentioning

confidence: 99%

Accelerated dual-averaging primal–dual method for composite convex minimization

Tan

Qian

et al. 2020

Optimization Methods and Software

View full text Add to dashboard Cite

Dual averaging-type methods are widely used in industrial machine learning applications due to their ability to promoting solution structure (e.g., sparsity) efficiently. In this paper, we propose a novel accelerated dual-averaging primal-dual algorithm for minimizing a composite convex function. We also derive a stochastic version of the proposed method which solves empirical risk minimization, and its advantages on handling sparse data are demonstrated both theoretically and empirically.

show abstract

Katyusha: the first direct acceleration of stochastic gradient methods

Cited by 268 publications

References 21 publications

Noisy Accelerated Power Method for Eigenproblems With Applications

Noisy Accelerated Power Method for Eigenproblems With Applications

The Frontier of SGD and its Variants in Machine Learning

Accelerated dual-averaging primal–dual method for composite convex minimization

Contact Info

Product

Resources

About