Shuheng Shen scite author profile

Shuheng Shen

5Publications

25Citation Statements Received

59Citation Statements Given

How they've been cited

How they cite others

Affiliations

Zhejiang Financial College, University of Science and Technology of China, Antea Group (France)

Publications

Order By: Most citations

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Shen

Liu

et al. 2019

View full text Add to dashboard Cite

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3× faster than traditional synchronous SGD. IntroductionThe training of deep neural networks is resource intensive and time-consuming. With the expansion of data and model scale, it may take a few days or weeks to train a deep model by using mini-batch SGD on a single machine/GPU. To accelerate the training process, distributed optimization provides an effective tool for deep net training by allocating the computation to multiple computing devices (CPUs or GPUs).When variants of mini-batch SGD are applied to a distributed system, communication between computing devices will be incurred to keep the same convergence rate as mini-batch SGD. As a matter of fact, the extra communication cost in a distributed system is the main factor which prevents a distributed optimization algorithm from achieving the linear time speedup, although the computation load is the same as its single machine version. In addition, the communication cost, which is often linearly proportional to the number of workers, can be extremely expensive when the number of workers is huge. Therefore, it is critical to reduce the communication bottleneck to make better use of the hardware resources.Given that the total amount of communication bits equals the number of communications multiplied by the number of bits per communication, several works are proposed to accelerate training by reducing the communication * The corresponding author. arXiv:1906.12043v1 [cs.LG] 28 Jun 2019 frequency Yu et al., 2018;Zhou and Cong, 2018] or communication bits [Alistarh et al., 2017;Lin et al., 2017;Wen et al., 2017]. However, even when the communication frequency or the number of bits per communication is reduced, hardware resources are not fully exploited in traditional synchronous distributed algorithms because of the following two reasons: (1) only partial resources can be used when workers are communicating with each other and (2) the computation and the communication are interdependent in each itera...

show abstract

Asynchronous Stochastic Composition Optimization with Variance Reduction

Shen¹,

Xu²,

Liu³

et al. 2018

Preprint

View full text Add to dashboard Cite

Composition optimization has drawn a lot of attention in a wide variety of machine learning domains from risk management to reinforcement learning. Existing methods solving the composition optimization problem often work in a sequential and single-machine manner, which limits their applications in large-scale problems. To address this issue, this paper proposes two asynchronous parallel variance reduced stochastic compositional gradient (AsyVRSC) algorithms that are suitable to handle large-scale data sets. The two algorithms are AsyVRSC-Shared for the shared-memory architecture and AsyVRSC-Distributed for the masterworker architecture. The embedded variance reduction techniques enable the algorithms to achieve linear convergence rates. Furthermore, AsyVRSC-Shared and AsyVRSC-Distributed enjoy provable linear speedup, when the time delays are bounded by the data dimensionality or the sparsity ratio of the partial gradients, respectively. Extensive experiments are conducted to verify the effectiveness of the proposed algorithms.

show abstract

Differentially Private Learning with Per-Sample Adaptive Clipping

Xia¹,

Shen²,

Yao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Differentially Private Learning with Per-Sample Adaptive Clipping

Xia

Shen

Yao

et al. 2023

AAAI

View full text Add to dashboard Cite

Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.

show abstract

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Shen

Liu

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shuheng Shen

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Asynchronous Stochastic Composition Optimization with Variance Reduction

Differentially Private Learning with Per-Sample Adaptive Clipping

Differentially Private Learning with Per-Sample Adaptive Clipping

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Contact Info

Product

Resources

About