2018
DOI: 10.48550/arxiv.1804.07612
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revisiting Small Batch Training for Deep Neural Networks

Abstract: Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental compa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

13
182
3

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 149 publications
(198 citation statements)
references
References 19 publications
13
182
3
Order By: Relevance
“…Moreover, largecohort training can introduce fundamental optimization and generalization issues. Our results are reminiscent of work on large-batch training in centralized settings, where larger batches can stagnate convergence improvements (Dean et al, 2012;You et al, 2017;Golmant et al, 2018;McCandlish et al, 2018;Yin et al, 2018), and even lead to generalization issues with deep neural networks (Shallue et al, 2019;Ma et al, 2018;Keskar et al, 2017;Hoffer et al, 2017;Masters and Luschi, 2018;Lin et al, 2019Lin et al, , 2020. While some of the challenges we identify with large-cohort training are parallel to issues that arise in large-batch centralized learning, others are unique to federated learning and have not been previously identified in the literature.…”
Section: Introductionmentioning
confidence: 58%
See 2 more Smart Citations
“…Moreover, largecohort training can introduce fundamental optimization and generalization issues. Our results are reminiscent of work on large-batch training in centralized settings, where larger batches can stagnate convergence improvements (Dean et al, 2012;You et al, 2017;Golmant et al, 2018;McCandlish et al, 2018;Yin et al, 2018), and even lead to generalization issues with deep neural networks (Shallue et al, 2019;Ma et al, 2018;Keskar et al, 2017;Hoffer et al, 2017;Masters and Luschi, 2018;Lin et al, 2019Lin et al, , 2020. While some of the challenges we identify with large-cohort training are parallel to issues that arise in large-batch centralized learning, others are unique to federated learning and have not been previously identified in the literature.…”
Section: Introductionmentioning
confidence: 58%
“…This property of diminishing returns has been explored both empirically (Dean et al, 2012;McCandlish et al, 2018;Golmant et al, 2018;Shallue et al, 2019) and theoretically (Ma et al, 2018;Yin et al, 2018). Beyond the issue of speedup saturation, numerous works have also observed a generalization gap when training deep neural networks with large batches (Keskar et al, 2017;Hoffer et al, 2017;You et al, 2017;Masters and Luschi, 2018;Lin et al, 2019Lin et al, , 2020. Our work differs from these areas by specifically exploring how the cohort size (the number of selected clients) affects federated optimization methods.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…We now turn to the question of how the learning parameters of our networks are estimated. The learning procedure consists in seeking an optimal vector parameter θ minimizing the energy J 1 (or J k , k > 1 depending on the problem) using a mini-batch stochastic gradient descent algorithm with adaptive momentum [55,56,57] on a set of training data.…”
Section: Reaction Network Based On a 1d Multilayer Perceptronmentioning
confidence: 99%
“…Different from [9], we don't use synchronized batch normalization (BN) but standard BN in our experiment, and we find that synchronized BN brings a little damage to the performance. The possible reason is that large batch size when synchronizing BN will result in local optimal solution [44], especially for the FAS task in our experiments.…”
Section: B Implementation Detailsmentioning
confidence: 99%