Communication-efficient algorithms for statistical optimization

Zhang, Yuchen; Duchi, John C.; Wainwright, Martin J.

doi:10.1109/cdc.2012.6426691

Cited by 273 publications

(388 citation statements)

References 18 publications

Supporting

Mentioning

382

Contrasting

Order By: Relevance

“…At the other extreme, there are distributed methods using only a single round of communication, such as [24,36,38,80,81]. These methods require additional assumptions on the partitioning of the data, which are usually not satisfied in practice if the data are distributed "as is", i.e.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Distributed optimization with arbitrary local solvers

Konečný

Jäggi

et al. 2017

Optimization Methods and Software

171

160

View full text Add to dashboard Cite

With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.Keywords: primal-dual algorithm; distributed computing; machine learning; convergence analysis 2010 Mathematics Subject Classification: 68W15; 68W20; 68W10; 68W40 MotivationRegression and classification techniques, represented in the general class of regularized loss minimization problems [71], are among the most central tools in modern big data analysis, machine learning, and signal processing. For these tasks, much effort from both industry and academia has gone into the development of highly tuned and customized solvers. However, with the massive growth of available datasets, major roadblocks still persist in the distributed setting, where data no longer fit in the memory of a single computer, and computation must be split across multiple machines in a network [3,7,12,18,22,29,32,34,37,46,52,62,64,67,78].On typical real-world systems, communicating data between machines is several orders of magnitude slower than reading data from main memory, e.g. when leveraging commodity hardware. Therefore when trying to translate existing highly tuned single machine solvers to the *Corresponding author. Email: takac.mt@gmail.com 814C. Ma et al. distributed setting, great care must be taken to avoid this significant communication bottleneck [26,74].While several distributed solvers for the problems of interest have been recently developed, they are often unable to fully leverage the competitive performance of their tuned and customized single machine counterparts, which have already received much more research attention. More importantly, it is unfortunate that distributed solvers cannot automatically benefit from improvements made to the single machine solvers, and therefore are forced to lag behind the most recent developments.In this paper, we make a step towards resolving these issues by proposing a general communication-efficient distribu...

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

Distributed optimization with arbitrary local solvers

Konečný

Jäggi

et al. 2017

Optimization Methods and Software

171

160

View full text Add to dashboard Cite

show abstract

“…When ERM is used and F (w) is λ-strongly convex, and f (w, z) is L-Lipschitz, H-smooth and has a J-Lipschitz Hessian, [29] obtain a guarantee onw of the following form (in expectation over the samples):…”

Section: Average-at-the-endmentioning

confidence: 99%

“…Optimizing over λ, the best that can be ensured from (13) for learning problems requiring regularization is therefor only a sample complexity that scales as 1/ 3 rather then 1/ 2 . If ERM is used on each machine [29] also suggested a bias-corrected approach that reduced the dependence on n in the second term to 1/n 3 , rather then 1/n 2 , but the problematic dependence on λ remains. These deficiencies are not only in the analysis.…”

Section: Average-at-the-endmentioning

confidence: 99%

Distributed stochastic optimization and learning

Shamir

Srebro

2014

2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton)

126

115

View full text Add to dashboard Cite

Abstract-We consider the problem of distributed stochastic optimization, where each of several machines has access to samples from the same source distribution, and the goal is to jointly optimize the expected objective w.r.t. the source distribution, minimizing: (1) overall runtime; (2) communication costs; (3) number of samples used. We study this problem systematically, highlighting fundamental limitations, and differences versus distributed consensus problems where each machine has a different, independent, objective. We show how the best known guarantees are obtained by an accelerated mini-batched SGD approach, and contrast the runtime and sample costs of the approach with those of other distributed optimization algorithms.

show abstract

“…An average mixture (AVGM) procedure for fitting the parameter of a parametric model has been studied by [10]. AVGM partitions the full available dataset into disjoint subsets, estimates the parameter within each subset, and finally combines the estimates by simple averaging.…”

Section: Introductionmentioning

confidence: 99%

Subsemble: an ensemble method for combining subset-specific algorithm fits

Sapp

Laan

Canny

2013

Journal of Applied Statistics

View full text Add to dashboard Cite

Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive datasets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large datasets. Subsemble partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small to moderate sized datasets, and often has better prediction performance than the underlying algorithm fit just once on the full dataset. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemlbe relative to the underlying algorithm fit just once on the full dataset.

show abstract

Communication-efficient algorithms for statistical optimization

Cited by 273 publications

References 18 publications

Distributed optimization with arbitrary local solvers

Distributed optimization with arbitrary local solvers

Distributed stochastic optimization and learning

Subsemble: an ensemble method for combining subset-specific algorithm fits

Contact Info

Product

Resources

About