Finding approximate local minima faster than gradient descent

Agarwal, Naman; Allen-Zhu, Zeyuan; Bullins, Brian; Hazan, Elad; Ma, Tengyu

doi:10.1145/3055399.3055464

Cited by 149 publications

(325 citation statements)

References 19 publications

Supporting

Mentioning

314

Contrasting

Unclassified

Order By: Relevance

“…Идея за-ключается в том, чтобы формировать матрицу Гессе оптимизируемой функции исходя из матриц Гессе относительно небольшого числа случайно выбранных слагаемых [Ghadimi et al, 2017]. Другая идея заключается в отказе от обращения матрицы Гессе на итерации, вместо этого пред-лагается использовать информацию о собственном векторе, отвечающем наименьшему собствен-ному значению [Agarwal et al, 2017;Carmon et al, 2017]. Для приближенного вычисления тако-го вектора вполне достаточно уметь умножать матрицу Гессе на произвольный вектор:…”

Section: Discussionunclassified

A hypothesis about the rate of global convergence for optimal methods (Newtons type) in smooth convex optimization

Gasnikov¹,

Kovalev²

2018

CRM

View full text Add to dashboard Cite

Section: Discussionunclassified

A hypothesis about the rate of global convergence for optimal methods (Newtons type) in smooth convex optimization

Gasnikov¹,

Kovalev²

2018

CRM

View full text Add to dashboard Cite

“…These approaches yield a worst case operational complexity of O(nǫ −3/2 g ) when ǫ H = ǫ 1/2 g . Two independently proposed algorithms, respectively based on adapting accelerated gradient to the nonconvex setting [11] and approximately solving the cubic regularization subproblem [1], requireÕ(ǫ −7/4 g ) operations (with high probability, showing dependency only on ǫ g ) to find a point x that satisfies (7) when ǫ H = ǫ 1/2 g . The difference of a factor of ǫ −1/4 g with the iteration complexity bounds arises from the cost of computing a negative curvature direction of ∇ 2 f (x k ) and/or the cost of solving a linear system.…”

Section: Related Workmentioning

confidence: 99%

“…1 Introduction We consider the following constrained optimization problem: (1) min f (x) subject to x ≥ 0, where f : R n → R is a nonconvex function, twice uniformly Lipschitz continuously differentiable in the interior of the nonnegative orthant. We assume that explicit storage of the Hessian ∇ 2 f (x) for x > 0 is undesirable, but that Hessian-vector products of the form ∇ 2 f (x)v can be computed at any x > 0 for arbitrary vectors v. Computational differentiation techniques [29] can be used to evaluate such products at a cost that is a small multiple of the cost of evaluation of the gradient ∇f .…”

mentioning

confidence: 99%

See 1 more Smart Citation

A log-barrier Newton-CG method for bound constrained optimization with complexity guarantees

O’Neill

Wright

2020

IMA Journal of Numerical Analysis

View full text Add to dashboard Cite

We describe an algorithm based on a logarithmic barrier function, Newton's method, and linear conjugate gradients that seeks an approximate minimizer of a smooth function over the nonnegative orthant. We develop a bound on the complexity of the approach, stated in terms of the required accuracy and the cost of a single gradient evaluation of the objective function and/or a matrix-vector multiplication involving the Hessian of the objective. The approach can be implemented without explicit calculation or storage of the Hessian.

show abstract

“…Several works study some special non-convex objective functions and find SGD or its variants can be convergence. Besides, some researchers [2] find that in many machine learning problems, the minimal value of local minimum is a good approximation for the global minimum. Moreover, it is not difficult to obtain a local minimum since the quantity of local minimum is significant.…”

Section: Distributed and Non-convex Extensionmentioning

confidence: 99%

The Frontier of SGD and its Variants in Machine Learning

Du¹

2017

2017 2nd International Conference on Mechatronics and Information Technology (ICMIT 2017)

View full text Add to dashboard Cite

Abstract. Numerical optimization is a classical field in operation research and computer science, which has been widely used in the areas such as physics and economics. Although, optimization algorithms have achieved great success for plenty of applications, handling the big data in the best fashion possible is a very inspiring and demanding challenge in the artificial intelligence era. Stochastic gradient descent (SGD) is pretty simple but surprisingly, highly effective in machine learning models, such as support vector machine (SVM) and deep neural network (DNN). Theoretically, the performance of SGD for convex optimization is well understood. But, for the non-convex setting, which is very common for the machine learning problems, to obtain the theoretical guarantee for SGD and its variants is still a standing problem. In the paper, we do a survey about the SGD and its variants such as Momentum, ADAM and SVRG, differentiate their algorithms and applications and present some recent breakthrough and open problems.

show abstract

Finding approximate local minima faster than gradient descent

Cited by 149 publications

References 19 publications

A hypothesis about the rate of global convergence for optimal methods (Newtons type) in smooth convex optimization

A hypothesis about the rate of global convergence for optimal methods (Newtons type) in smooth convex optimization

A log-barrier Newton-CG method for bound constrained optimization with complexity guarantees

The Frontier of SGD and its Variants in Machine Learning

Contact Info

Product

Resources

About