Coordinate descent with arbitrary sampling II: expected separable overapproximation

Qu, Zheng; Richtárik, Peter

doi:10.1080/10556788.2016.1190361

Cited by 42 publications

(90 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mini-batch methods (which instead of just one data-example use updates from several examples per iteration) are more flexible and lie within these two communication vs. computation extremes. However, mini-batch versions of both SGD and coordinate descent (CD) [13,14,37,[46][47][48][52][53][54]61,69,74] suffer from their convergence rate degrading towards the rate of batch gradient descent as the size of the mini-batch is increased. This follows because mini-batch updates are made based on the outdated previous parameter vector w, in contrast to methods that allow immediate local updates like CoCoA.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Distributed optimization with arbitrary local solvers

Konečný

Jäggi

et al. 2017

Optimization Methods and Software

Self Cite

171

160

View full text Add to dashboard Cite

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.Keywords: primal-dual algorithm; distributed computing; machine learning; convergence analysis 2010 Mathematics Subject Classification: 68W15; 68W20; 68W10; 68W40 MotivationRegression and classification techniques, represented in the general class of regularized loss minimization problems [71], are among the most central tools in modern big data analysis, machine learning, and signal processing. For these tasks, much effort from both industry and academia has gone into the development of highly tuned and customized solvers. However, with the massive growth of available datasets, major roadblocks still persist in the distributed setting, where data no longer fit in the memory of a single computer, and computation must be split across multiple machines in a network [3,7,12,18,22,29,32,34,37,46,52,62,64,67,78].On typical real-world systems, communicating data between machines is several orders of magnitude slower than reading data from main memory, e.g. when leveraging commodity hardware. Therefore when trying to translate existing highly tuned single machine solvers to the *Corresponding author. Email: takac.mt@gmail.com 814C. Ma et al. distributed setting, great care must be taken to avoid this significant communication bottleneck [26,74].While several distributed solvers for the problems of interest have been recently developed, they are often unable to fully leverage the competitive performance of their tuned and customized single machine counterparts, which have already received much more research attention. More importantly, it is unfortunate that distributed solvers cannot automatically benefit from improvements made to the single machine solvers, and therefore are forced to lag behind the most recent developments.In this paper, we make a step towards resolving these issues by proposing a general communication-efficient distribu...

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

Distributed optimization with arbitrary local solvers

Konečný

Jäggi

et al. 2017

Optimization Methods and Software

Self Cite

171

160

View full text Add to dashboard Cite

show abstract

“…In the time between the first online appearance of this work on arXiv (October 2013; arXiv:1310.3438), and the time this paper went to press, this work led to a number of extensions [3,7,[16][17][18]. All of these papers share the defining feature of NSync, namely, its ability to work with an arbitrary probability law defining the selection of the active coordinates in each iteration.…”

Section: Literaturementioning

confidence: 99%

“…Motivated by the introduction of the nonuniform ESO assumption in this paper, and the development in Sect. 3 of our work, an entire paper was recently written, dedicated to the study of nonuniform ESO inequalities [16]. 1 We now turn to the second and final assumption.…”

Section: Assumptionsmentioning

confidence: 99%

“…1 A clarifying comment answering a question raised by the reviewer: The authors of [16] give explicit formulas for w for which (2) holds, under an assumption that is slightly weaker than Lipschitz continuity of the gradient of φ. In particular, they study functions φ admitting the global quadratic upper bound…”

Section: Theorem 3 Let Assumptions 1 and 2 Be Satisfied Choose Xmentioning

confidence: 99%

“…However, this choice of parameters is rather conservative. The goal of [16] is to give explicit and tight formulas for w, where hopefully w i will be much smaller than τ A :i 2 , utilizing specific properties of the samplingŜ and data matrix A.…”

Section: Theorem 3 Let Assumptions 1 and 2 Be Satisfied Choose Xmentioning

confidence: 99%

See 2 more Smart Citations

On optimal probabilities in stochastic coordinate descent methods

2015

Self Cite

View full text Add to dashboard Cite

We propose and analyze a new parallel coordinate descent methodNSync-in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration-with optimal probabilities-may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

show abstract

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

2020

Self Cite

View full text Add to dashboard Cite

We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method-JacSketch-is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variancereduced unbiased estimator of the gradient. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a stochastic quasi-gradient method. Our method can also be seen as stochastic gradient descent applied to a controlled stochastic optimization reformulation of the original problem, where the control comes from the Jacobian estimates. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches, featuring a novel proof technique based on a stochastic Lyapunov function. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the celebrated stochastic average gradient (SAGA) method, and its several existing and many new minibatch, reduced memory, The first results of this paper were obtained in Fall 2015 and most key results were obtained by Fall 2016. All key results were obtained by Fall 2017. The first author gave a series of talks on the results (before the paper was released online) in November 2016

show abstract

Coordinate descent with arbitrary sampling II: expected separable overapproximation

Cited by 42 publications

References 30 publications

Distributed optimization with arbitrary local solvers

Distributed optimization with arbitrary local solvers

On optimal probabilities in stochastic coordinate descent methods

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Contact Info

Product

Resources

About