We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate O(k −1/4 ). As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set.
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science-including all popular deep learning architectures.
The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion. IntroductionUnder favorable conditions, many fundamental optimization algorithms converge linearly: the distance of the iterates to the optimal solution set (the "error") is bounded by a decreasing geometric sequence. Classical optimization literature highlights how quadratic growth properties of the objective function, typically guaranteed through second-order optimality conditions, ensure such linear convergence. Central examples traditionally include the method of steepest descent for smooth minimization [5, Theorem 3.4] and, more abstractly, the proximal point method for nonsmooth convex problems [41, Theorem 2, Proposition 7].More recent techniques, originally highlighted in the work of Luo and Tseng [26], postulate that the step length at each iteration of the algorithm linearly bounds the error. Such "error bounds" are commonly used in the analysis of first-order methods for strongly convex functions, popular in modern applications such as machine learning and high-dimensional statistics, including in particular the proximal gradient method and its variants; see for example Nesterov [34] and Beck-Teboulle [7]. Convergence analysis based only on the error bound property is appealingly simple even without strong convexity, but the underlying assumption on the optimization problem is opaque at least at first sight.
We consider global efficiency of algorithms for minimizing a sum of a convex function and a composition of a Lipschitz convex function with a smooth map. The basic algorithm we rely on is the prox-linear method, which in each iteration solves a regularized subproblem formed by linearizing the smooth map. When the subproblems are solved exactly, the method has efficiency O(ε −2 ), akin to gradient descent for smooth minimization. We show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity O(ε −3 ). The technique readily extends to minimizing an average of m composite functions, with complexity O(m/ε 2 + √ m/ε 3 ) in expectation. We round off the paper with an inertial prox-linear method that automatically accelerates in presence of convexity.The proximal gradient algorithm, investigated by Beck-Teboulle [4] and Nesterov [54, Section 3], is a popular first-order method for additive composite minimization. Much of the current paper will center around the prox-linear method, which is a direct extension of the prox-gradient algorithm to the entire problem class (1.1). In each iteration, the prox-linear method linearizes the smooth map c(·) and solves the proximal subproblem:for an appropriately chosen parameter t > 0. The underlying assumption here is that the strongly convex proximal subproblems (1.3) can be solved efficiently. This is indeed reasonable in some circumstances. For example, one may have available specialized methods for the proximal subproblems, or interior-point points methods may be available for moderate dimensions d and m, or it may be that case that computing an accurate estimate of ∇c(x) is already the bottleneck (see e.g. Example 3.5). The prox-linear method was recently investigated in [13,23,38,53], though the ideas behind the algorithm and of its trust-region variants are much older [8,13,28,58,59,70,72]. The scheme (1.3) reduces to the popular prox-gradient algorithm for additive composite minimization, while for nonlinear least squares, the algorithm is closely related to the Gauss-Newton algorithm [55, Section 10]. Our work focuses on global efficiency estimates of numerical methods. Therefore, in line with standard assumptions in the literature, we assume that h is L-Lipschitz and the Jacobian map ∇c is β-Lipschitz. As in the analysis of the prox-gradient method in Nesterov [48,52], it is convenient to measure the progress of the prox-linear method in terms of the scaled steps, called the prox-gradients:A short argument shows that with the optimal choice t = (Lβ) −1 , the prox-linear algorithm will find a point x satisfying G 1Lβ (x) ≤ ε after at most O( Lβ ε 2 (F (x 0 ) − inf F )) iterations; see e.g. [13,23]. We mention in passing that iterate convergence under the K L-inequality was recently shown in [5,56], while local linear/quadratic rates under appropriate regularity conditions were proved in [11,23,53]. The contributions of our work are as foll...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.