With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3× faster than traditional synchronous SGD. IntroductionThe training of deep neural networks is resource intensive and time-consuming. With the expansion of data and model scale, it may take a few days or weeks to train a deep model by using mini-batch SGD on a single machine/GPU. To accelerate the training process, distributed optimization provides an effective tool for deep net training by allocating the computation to multiple computing devices (CPUs or GPUs).When variants of mini-batch SGD are applied to a distributed system, communication between computing devices will be incurred to keep the same convergence rate as mini-batch SGD. As a matter of fact, the extra communication cost in a distributed system is the main factor which prevents a distributed optimization algorithm from achieving the linear time speedup, although the computation load is the same as its single machine version. In addition, the communication cost, which is often linearly proportional to the number of workers, can be extremely expensive when the number of workers is huge. Therefore, it is critical to reduce the communication bottleneck to make better use of the hardware resources.Given that the total amount of communication bits equals the number of communications multiplied by the number of bits per communication, several works are proposed to accelerate training by reducing the communication * The corresponding author. arXiv:1906.12043v1 [cs.LG] 28 Jun 2019 frequency Yu et al., 2018;Zhou and Cong, 2018] or communication bits [Alistarh et al., 2017;Lin et al., 2017;Wen et al., 2017]. However, even when the communication frequency or the number of bits per communication is reduced, hardware resources are not fully exploited in traditional synchronous distributed algorithms because of the following two reasons: (1) only partial resources can be used when workers are communicating with each other and (2) the computation and the communication are interdependent in each itera...
Composition optimization has drawn a lot of attention in a wide variety of machine learning domains from risk management to reinforcement learning. Existing methods solving the composition optimization problem often work in a sequential and single-machine manner, which limits their applications in large-scale problems. To address this issue, this paper proposes two asynchronous parallel variance reduced stochastic compositional gradient (AsyVRSC) algorithms that are suitable to handle large-scale data sets. The two algorithms are AsyVRSC-Shared for the shared-memory architecture and AsyVRSC-Distributed for the masterworker architecture. The embedded variance reduction techniques enable the algorithms to achieve linear convergence rates. Furthermore, AsyVRSC-Shared and AsyVRSC-Distributed enjoy provable linear speedup, when the time delays are bounded by the data dimensionality or the sparsity ratio of the partial gradients, respectively. Extensive experiments are conducted to verify the effectiveness of the proposed algorithms.
Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.