2018
DOI: 10.48550/arxiv.1811.03600
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Measuring the Effects of Data Parallelism on Neural Network Training

Abstract: Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
53
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 35 publications
(56 citation statements)
references
References 25 publications
3
53
0
Order By: Relevance
“…A clear picture emerges from these observations. Previous research suggests that in order to effectively leverage larger batch sizes, one has to increase the learning rate in tandem with the batch size [12,7,27,19]. Our results suggest that large values of λ 1 place a sharp limit on the maximum the learning rate possible and therefore, limit the model's ability to leverage data parallelism effectively.…”
Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning
confidence: 80%
See 3 more Smart Citations
“…A clear picture emerges from these observations. Previous research suggests that in order to effectively leverage larger batch sizes, one has to increase the learning rate in tandem with the batch size [12,7,27,19]. Our results suggest that large values of λ 1 place a sharp limit on the maximum the learning rate possible and therefore, limit the model's ability to leverage data parallelism effectively.…”
Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning
confidence: 80%
“…Previous research has studied the interplay of the loss curvature and batch size scaling from various different perspectives. Most notably, Shallue et al [27] observe that increasing the batch size yields consistent improvements in training speed until a (problem-dependent) critical batch size is reached; increasing the batch size beyond this threshold yields diminishing improvements in training speed. Zhang et al [35] observe that a simple Noisy Quadratic Model (NQM) is able to capture the empirical behavior observed in [27].…”
Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning
confidence: 99%
See 2 more Smart Citations
“…Here we consider behavior as a function of the total number of examples processed, so another way to put this is that doubling the batch size halves the number of steps needed. Shallue et al [2018] and refer to this as "perfect scaling".…”
Section: Batch Size-invariancementioning
confidence: 99%