We examine fundamental tradeoffs in iterative distributed zeroth and first order stochastic optimization in multiagent networks in terms of communication cost (number of per-node transmissions) and computational cost, measured by the number of per-node noisy function (respectively, gradient) evaluations with zeroth order (respectively, first order) methods. Specifically, we develop novel distributed stochastic optimization methods for zeroth and first order strongly convex optimization by utilizing a probabilistic inter-agent communication protocol that increasingly sparsifies communications among agents as time progresses. Under standard assumptions on the cost functions and the noise statistics, we establish with the proposed method the O(1/(Ccomm) 4/3−ζ ) and O(1/(Ccomm) 8/9−ζ ) mean square error convergence rates, for the first and zeroth order optimization, respectively, where Ccomm is the expected number of network communications and ζ > 0 is arbitrarily small. The methods are shown to achieve order-optimal convergence rates in terms of computational cost Ccomp, O(1/Ccomp) (first order optimization) and O(1/(Ccomp) 2/3 ) (zeroth order optimization), while achieving the order-optimal convergence rates in terms of iterations. Experiments on real-life datasets illustrate the efficacy of the proposed algorithms.
INTRODUCTIONStochastic optimization has taken a central role in problems of learning and inference making over large data sets.Many practical setups are inherently distributed in which, due to sheer data size, it may not be feasible to store data in a single machine or agent. Further, due to the complexity of the objective functions (often, loss functions in the context of learning and inference problems), explicit computation of gradients or exactly evaluating the objective at desired arguments could be computationally prohibitive. The class of stochastic optimization problems of interest can be formalized in the following way:where the information available to implement an optimization scheme usually involves gradients, i.e., ∇F (x; ξ) or function values of F (x; ξ) itself. However, both the gradients and the function values are only unbiased estimates of the gradients and the function values of the desired objective f (x). Moreover, due to huge data sizes and distributed