Coupling Adaptive Batch Sizes with Learning Rates

Balles, Lukas; Romero, Javier; Hennig, Philipp

doi:10.48550/arxiv.1612.05086

Cited by 15 publications

(22 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Suppose that in l th time interval, the objective function has a local Lipschitz smoothness L l . Then, by using the approximation η l L l ≈ 1, which is common in SGD literature (Balles et al, 2016), we derive the following adaptive strategy:…”

Section: Incorporating Adaptive Learning Ratementioning

confidence: 99%

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Wang,

Joshi

2018

Preprint

View full text Add to dashboard Cite

Large-scale machine learning training, in particular, distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of ADACOMM, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that ADACOMM can take 3× less time than fully synchronous SGD and still reach the same final training loss.

show abstract

Section: Incorporating Adaptive Learning Ratementioning

confidence: 99%

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Wang,

Joshi

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…In this work we instead take inspiration from more recent literature [37] and build stochastic models which include the effect of a decreasing learning rate into the drift and the volatility coefficients through the adjustment function ψ(•). This allows, in contrast to the ODE method 7 , to provide non-asymptotic arguments and convergence rates.…”

Section: Approximation Guaranteesmentioning

confidence: 99%

“…Decreasing mini-batch size. From Thm.1,2,3, it is clear that, as it is well known [10,7], a simple way to converge to a local minimizer is to pick b(•) increasing as a function of time. However, this corresponds to dramatically increasing the complexity in terms of gradient computations.…”

Section: Continuous-time Analysismentioning

confidence: 99%

Continuous-time Models for Stochastic Optimization Algorithms

Orvieto,

Lucchi

2018

Preprint

View full text Add to dashboard Cite

We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. We exploit these continuous-time models, together with simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. Guided by such analysis, we show that the same Lyapunov arguments hold in discrete-time, leading to matching rates. In addition, we use these models and Itô calculus to infer novel insights on the dynamics of SGD, proving that a decreasing learning rate acts as time warping or, equivalently, as landscape stretching.

show abstract

“…We note that this could be also useful in other problems involving statistics of individual gradients, e.g. computing the gradient variance (Zhao & Zhang, 2015;Balles et al, 2016;Mahsereci & Hennig, 2017;Balles & Hennig, 2018), which is out of our scope.…”

Section: Introductionmentioning

confidence: 99%

Large Scale Private Learning via Low-rank Reparametrization

Yu¹,

Zhang²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients, 2) the added noise suffering notorious dimensional dependence. Specifically, we reparametrize each weight matrix with two gradient-carrier matrices of small dimension and a residual weight matrix. We argue that such reparametrization keeps the forward/backward process unchanged while enabling us to compute the projected gradient without computing the gradient itself. To learn with differential privacy, we design reparametrized gradient perturbation (RGP) that perturbs the gradients on gradientcarrier matrices and reconstructs an update for the original weight from the noisy gradients. Importantly, we use historical updates to find the gradient-carrier matrices, whose optimality is rigorously justified under linear regression and empirically verified with deep learning tasks. RGP significantly reduces the memory cost and improves the utility. For example, we are the first able to apply differential privacy on the BERT model and achieve an average accuracy of 83.9% on four downstream tasks with = 8, which is within 5% loss compared to the non-private baseline but enjoys much lower privacy leakage risk.

show abstract

Coupling Adaptive Batch Sizes with Learning Rates

Cited by 15 publications

References 8 publications

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Continuous-time Models for Stochastic Optimization Algorithms

Large Scale Private Learning via Low-rank Reparametrization

Contact Info

Product

Resources

About