Dongruo Zhou scite author profile

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

show abstract

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Chen¹,

Zhou

Tang

et al. 2018

Preprint

View full text Add to dashboard Cite

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method (Padam), which unifies the Adam/Amsgrad with SGD to achieve the best from both worlds. Experiments on standard benchmarks show that Padam can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

show abstract

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Chen

Zhou

Tang

et al. 2020

View full text Add to dashboard Cite

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

show abstract

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Zhou¹,

Gu²,

Szepesvári³

2020

Preprint

View full text Add to dashboard Cite

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020;Ayoub et al., 2020;Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR + for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR + attains an O(dH √ T ) regret where d is the dimension of feature mapping, H is the length of the episode and T is the number of interactions with the MDP. We also prove a matching lower bound Ω(dH √ T ) for this setting, which shows that UCRL-VTR + is minimax optimal up to logarithmic factors. In addition, we propose the UCLK + algorithm for the same family of MDPs under discounting and show that it attains an O(d √ T /(1 − γ) 1.5 ) regret, where γ ∈ [0, 1) is the discount factor. Our upper bound matches the lower bound Ω(d √ T /(1 − γ) 1.5 ) proved in Zhou et al. ( 2020) up to logarithmic factors, suggesting that UCLK + is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

show abstract

A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks

Chen

Zhou

et al. 2020

AAAI

View full text Add to dashboard Cite

Depending on how much information an adversary can access to, adversarial attacks can be classified as white-box attack and black-box attack. For white-box attack, optimization-based attack algorithms such as projected gradient descent (PGD) can achieve relatively high attack success rates within moderate iterates. However, they tend to generate adversarial examples near or upon the boundary of the perturbation set, resulting in large distortion. Furthermore, their corresponding black-box attack algorithms also suffer from high query complexities, thereby limiting their practical usefulness. In this paper, we focus on the problem of developing efficient and effective optimization-based adversarial attack algorithms. In particular, we propose a novel adversarial attack framework for both white-box and black-box settings based on a variant of Frank-Wolfe algorithm. We show in theory that the proposed attack algorithms are efficient with an O(1/√T) convergence rate. The empirical results of attacking the ImageNet and MNIST datasets also verify the efficiency and effectiveness of the proposed algorithms. More specifically, our proposed algorithms attain the best attack performances in both white-box and black-box attacks among all baselines, and are more time and query efficient than the state-of-the-art.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dongruo Zhou

Gradient descent optimizes over-parameterized deep ReLU networks

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks

Contact Info

Product

Resources

About