Zhenxun Zhuang scite author profile

Zhenxun Zhuang

4Publications

43Citation Statements Received

99Citation Statements Given

How they've been cited

How they cite others

Affiliations

Boston University

Publications

Order By: Most citations

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

Liu¹,

Zhuang²,

Lei³

et al. 2022

Preprint

View full text Add to dashboard Cite

In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O 1 N 4 iteration complexity for finding an -stationary point, where N is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.

show abstract

No-Regret Non-Convex Online Meta-Learning

Zhuang

Wang

2020

View full text Add to dashboard Cite

Understanding AdamW through Proximal Methods and Scale-Freeness

Zhuang¹,

Liu²,

Cutkosky³

et al. 2022

Preprint

View full text Add to dashboard Cite

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared 2 regularizer (referred to as Adam-2 ). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-2 . Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-2 . Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-2 and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.

show abstract

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization

Zhuang¹,

Cutkosky²,

Orabona³

2019

Preprint

View full text Add to dashboard Cite

Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully hand-picked stepsize for fast convergence, which is notoriously tedious and time-consuming to tune. Over the last several years, a plethora of adaptive gradient-based algorithms have emerged to ameliorate this problem. In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a non-convex smooth objective function onto an online convex optimization problem. This allows the use of noregret online algorithms to compute optimal stepsizes on the fly. In turn, this results in a SGD algorithm with self-tuned stepsizes that guarantees convergence rates that are automatically adaptive to the level of noise.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zhenxun Zhuang

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

No-Regret Non-Convex Online Meta-Learning

Understanding AdamW through Proximal Methods and Scale-Freeness

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization

Contact Info

Product

Resources

About