Measuring the Effects of Data Parallelism on Neural Network Training

Shallue, Christopher J.; Lee, Jae‐Hoon; Antognini, Joseph F.; Sohl-Dickstein, Jascha; Frostig, Roy; Dahl, George E.

doi:10.48550/arxiv.1811.03600

Cited by 35 publications

(56 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A clear picture emerges from these observations. Previous research suggests that in order to effectively leverage larger batch sizes, one has to increase the learning rate in tandem with the batch size [12,7,27,19]. Our results suggest that large values of λ 1 place a sharp limit on the maximum the learning rate possible and therefore, limit the model's ability to leverage data parallelism effectively.…”

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 80%

“…Previous research has studied the interplay of the loss curvature and batch size scaling from various different perspectives. Most notably, Shallue et al [27] observe that increasing the batch size yields consistent improvements in training speed until a (problem-dependent) critical batch size is reached; increasing the batch size beyond this threshold yields diminishing improvements in training speed. Zhang et al [35] observe that a simple Noisy Quadratic Model (NQM) is able to capture the empirical behavior observed in [27].…”

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

“…Most notably, Shallue et al [27] observe that increasing the batch size yields consistent improvements in training speed until a (problem-dependent) critical batch size is reached; increasing the batch size beyond this threshold yields diminishing improvements in training speed. Zhang et al [35] observe that a simple Noisy Quadratic Model (NQM) is able to capture the empirical behavior observed in [27]. Similarly, McCandlish et al [19] use quadratic approximations to the loss to provide a closed form expression 94.65 Table 1: A comparison between the GradInit method and Kaiming initialization with gradient clipping on the DenseNet-100 architecture.…”

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

“…4 We then measure the number of training steps required to reach 85% validation accuracy, and the optimal learning rate found for each batch size. Similar to [27], we normalize the plotted steps to 85% accuracy by the value measured at batch size 64.…”

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

See 3 more Smart Citations

A Loss Curvature Perspective on Training Instability in Deep Learning

Gilmer,

Ghorbani,

Garg

et al. 2021

Preprint

View full text Add to dashboard Cite

In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoidor navigate out of-regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.

show abstract

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 80%

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

Section: The Interaction Between Learning Rate Warmup Initialization ...mentioning

confidence: 99%

See 2 more Smart Citations

A Loss Curvature Perspective on Training Instability in Deep Learning

Gilmer,

Ghorbani,

Garg

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Here we consider behavior as a function of the total number of examples processed, so another way to put this is that doubling the batch size halves the number of steps needed. Shallue et al [2018] and refer to this as "perfect scaling".…”

Section: Batch Size-invariancementioning

confidence: 99%

Batch size-invariance for policy optimization

Hilton¹,

Cobbe²,

John³

2021

Preprint

View full text Add to dashboard Cite

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.

show abstract

Federated Visual Classification with Real-World Data Distribution

Hsu

Brown

2020

Lecture Notes in Computer Science

143

View full text Add to dashboard Cite

Federated Learning enables visual models to be trained ondevice, bringing advantages for user privacy (data need never leave the device), but challenges in terms of data diversity and quality. Whilst typical models in the datacenter are trained using data that are independent and identically distributed (IID), data at source are typically far from IID. Furthermore, differing quantities of data are typically available at each device (imbalance). In this work, we characterize the effect these real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. To do so, we introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits that simulate real-world edge learning scenarios. We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.

show abstract

Measuring the Effects of Data Parallelism on Neural Network Training

Cited by 35 publications

References 25 publications

A Loss Curvature Perspective on Training Instability in Deep Learning

A Loss Curvature Perspective on Training Instability in Deep Learning

Batch size-invariance for policy optimization

Federated Visual Classification with Real-World Data Distribution

Contact Info

Product

Resources

About