This article proposes a communication-efficient decentralized deep learning algorithm, coined layer-wise federated group ADMM (L-FGADMM). To minimize an empirical risk, every worker in L-FGADMM periodically communicates with two neighbors, in which the periods are separately adjusted for different layers of its deep neural network. A constrained optimization problem for this setting is formulated and solved using the stochastic version of GADMM proposed in our prior work. Numerical evaluations show that by less frequently exchanging the largest layer, L-FGADMM can significantly reduce the communication cost, without compromising the convergence speed. Surprisingly, despite less exchanged information and decentralized operations, intermittently skipping the largest layer consensus in L-FGADMM creates a regularizing effect, thereby achieving the test accuracy as high as federated learning (FL), a baseline method with the entire layer consensus by the aid of a central entity.Layer-wise Federated GADMM (L-FGADMM). To bridge the gap between FL and GADMM, in this article we propose L-FGADMM, by integrating the periodic communication and random data sampling properties of FL into GADMM under a deep NN architecture. To further improve communication efficiency, as illustrated in Fig. 1c, L-GADMM applies a different communication period to each layer. By exchanging the largest layer 2x less frequently than the other layers, our results show that L-FGADMM achieves the same test accuracy while saving 48.8% and 60.8% average communication cost, compared to the case using the same communication period for all layers and FL, respectively.
RelatedWorks. Towards improving communication efficiency of distributed ML, under centralized ML, the number of communication rounds can be reduced by collaboratively adjusting the training momentum [11], [12]. On the other hand, the number of communication links can be decreased by collecting model updates until a time deadline [13], upon the values sufficiently changed from the preceding updates [14], [15], or based on channel conditions [16]-[18]. Furthermore, the communication payload can be compressed by 1-bit gradient quantization [19], multi-bit gradient quantization [15], or weight quantization with random rotation [20]. Alternatively, instead of model parameters, model outputs can be exchanged for large models via knowledge distillation [21], [22]. Similar principles are applicable for communicationefficient decentralized ML. Without any central entity, communication payload sizes can be reduced by a quantized weight gossiping algorithm [23], ignoring communication link reduction.Alternatively, the number of communication links and rounds can be decreased using GADMM proposed in our prior work [8]. Furthermore, by integrating stochastic quantization into GADMM, quantized GADMM (Q-GADMM) was proposed to reduce communication rounds, links, and payload sizes altogether [10]. To achieve the same goals, instead of quantization as in Q-GADMM, L-FGADMM applies a layerwise federation to GADMM...