With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3× faster than traditional synchronous SGD.
IntroductionThe training of deep neural networks is resource intensive and time-consuming. With the expansion of data and model scale, it may take a few days or weeks to train a deep model by using mini-batch SGD on a single machine/GPU. To accelerate the training process, distributed optimization provides an effective tool for deep net training by allocating the computation to multiple computing devices (CPUs or GPUs).When variants of mini-batch SGD are applied to a distributed system, communication between computing devices will be incurred to keep the same convergence rate as mini-batch SGD. As a matter of fact, the extra communication cost in a distributed system is the main factor which prevents a distributed optimization algorithm from achieving the linear time speedup, although the computation load is the same as its single machine version. In addition, the communication cost, which is often linearly proportional to the number of workers, can be extremely expensive when the number of workers is huge. Therefore, it is critical to reduce the communication bottleneck to make better use of the hardware resources.Given that the total amount of communication bits equals the number of communications multiplied by the number of bits per communication, several works are proposed to accelerate training by reducing the communication * The corresponding author. arXiv:1906.12043v1 [cs.LG] 28 Jun 2019 frequency Yu et al., 2018;Zhou and Cong, 2018] or communication bits [Alistarh et al., 2017;Lin et al., 2017;Wen et al., 2017]. However, even when the communication frequency or the number of bits per communication is reduced, hardware resources are not fully exploited in traditional synchronous distributed algorithms because of the following two reasons: (1) only partial resources can be used when workers are communicating with each other and (2) the computation and the communication are interdependent in each itera...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.