The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Ge, Rong; Kakade, Sham M.; Kidambi, Rahul; Netrapalli, Praneeth

doi:10.48550/arxiv.1904.12838

Cited by 18 publications

(19 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3. The red line is an "adaptive" gradient descent method [20,21] whereby we start out with α = 1 but successively decrease it by a geometric factor β whenever an update of the parameters causes a decrease in the objective function. Note that there is still a hyperparameter here in choosing the size of the geometric decay factor β.…”

Section: Resultsmentioning

confidence: 99%

Efficient method for accelerating line searches using a combined Schur complement domain decomposition and Born series expansions in photonic-based adjoint optimization

Zhao¹,

Boutami²,

Fan³

2021

Preprint

View full text Add to dashboard Cite

A line search in gradient-based optimization algorithm solves the problem of determining the optimal learning rate for a given gradient or search direction in a single iteration. For most problems, this is determined by evaluating different candidate learning rates to find the optimum, which can be expensive. Recent work has provided an efficient way to perform a line search with the use of the Shanks transformation of a Born series derived from the Lippman-Schwinger formalism. In this paper we show that the cost for performing such a line search can be further reduced with the use of the method of the Schur complement domain decomposition, which can lead to a 10-fold total speed-up resulting from the reduced number of iterations to convergence and reduced wall-clock time per iteration.Our paper builds upon two recent advances in the development of gradient descent approaches for the optimization of photonic devices. First, in Refs. [10,11] it was shown that the Lippman-Schwinger equation and its corresponding

show abstract

Section: Resultsmentioning

confidence: 99%

Efficient method for accelerating line searches using a combined Schur complement domain decomposition and Born series expansions in photonic-based adjoint optimization

Zhao¹,

Boutami²,

Fan³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The bounded gradient is generally assumed in the non-convex/convex convergence analysis of SGD (Nesterov, 2003;Reddi et al, 2016). And the learning rate schedule is necessary for the analysis of SGD to decay its constant gradient variance (Ge et al, 2019).…”

Section: Training Dynamic Of Gcnmentioning

confidence: 99%

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Zhang¹,

Wipf²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD. From a practical standpoint, this translates into lower variance estimates and competitive or superior test accuracy across a several benchmarks.

show abstract

“…We use a popular step decay learning rate schedule as our baseline [20], which is used in the open source ResNet implementation [19]. It contains three components: initial learning rate, discount step, and discount factor.…”

Section: Baseline Learning Rate Schedulementioning

confidence: 99%

“…The baseline schedule starts from the initial learning rate, then it decreases by the discount factor every discount steps. In the baseline experiments, we test all combinations from the initial learning rate in [0.1, 0.01, 0.001, 0.0001], the discount step in [10,20,50,100], and the discount factor in [0.99, 0.9, 0.88]. After choosing the best baseline schedule, we run it 10 times with the same set of hyper-parameters and report mean and standard deviation of test loss and accuracy.…”

Section: Baseline Learning Rate Schedulementioning

confidence: 99%

Learning an Adaptive Learning Rate Schedule

Xu,

Dai,

Kemp

et al. 2019

Preprint

View full text Add to dashboard Cite

The learning rate is one of the most important hyper-parameters for model training and generalization. However, current hand-designed parametric learning rate schedules offer limited flexibility and the predefined schedule may not match the training dynamics of high dimensional and non-convex optimization problems. In this paper, we propose a reinforcement learning based framework that can automatically learn an adaptive learning rate schedule by leveraging the information from past training histories. The learning rate dynamically changes based on the current training dynamics. To validate this framework, we conduct experiments with different neural network architectures on the Fashion MINIST and CIFAR10 datasets. Experimental results show that the auto-learned learning rate controller can achieve better test results. In addition, the trained controller network is generalizable -able to be trained on one data set and transferred to new problems.Preprint. Under review.

show abstract

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Cited by 18 publications

References 24 publications

Efficient method for accelerating line searches using a combined Schur complement domain decomposition and Born series expansions in photonic-based adjoint optimization

Efficient method for accelerating line searches using a combined Schur complement domain decomposition and Born series expansions in photonic-based adjoint optimization

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Learning an Adaptive Learning Rate Schedule

Contact Info

Product

Resources

About