2016
DOI: 10.48550/arxiv.1612.05086
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Coupling Adaptive Batch Sizes with Learning Rates

Abstract: Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
20
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 15 publications
(22 citation statements)
references
References 8 publications
2
20
0
Order By: Relevance
“…Suppose that in l th time interval, the objective function has a local Lipschitz smoothness L l . Then, by using the approximation η l L l ≈ 1, which is common in SGD literature (Balles et al, 2016), we derive the following adaptive strategy:…”
Section: Incorporating Adaptive Learning Ratementioning
confidence: 99%
“…Suppose that in l th time interval, the objective function has a local Lipschitz smoothness L l . Then, by using the approximation η l L l ≈ 1, which is common in SGD literature (Balles et al, 2016), we derive the following adaptive strategy:…”
Section: Incorporating Adaptive Learning Ratementioning
confidence: 99%
“…In this work we instead take inspiration from more recent literature [37] and build stochastic models which include the effect of a decreasing learning rate into the drift and the volatility coefficients through the adjustment function ψ(•). This allows, in contrast to the ODE method 7 , to provide non-asymptotic arguments and convergence rates.…”
Section: Approximation Guaranteesmentioning
confidence: 99%
“…Decreasing mini-batch size. From Thm.1,2,3, it is clear that, as it is well known [10,7], a simple way to converge to a local minimizer is to pick b(•) increasing as a function of time. However, this corresponds to dramatically increasing the complexity in terms of gradient computations.…”
Section: Continuous-time Analysismentioning
confidence: 99%
“…We note that this could be also useful in other problems involving statistics of individual gradients, e.g. computing the gradient variance (Zhao & Zhang, 2015;Balles et al, 2016;Mahsereci & Hennig, 2017;Balles & Hennig, 2018), which is out of our scope.…”
Section: Introductionmentioning
confidence: 99%