Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Puri, Raul; Kirby, Robert; Yakovenko, N. G.; Catanzaro, Bryan

doi:10.48550/arxiv.1808.01371

Cited by 4 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Whereas several groups have successfully used this rule to train on the ImageNet dataset in under an hour, e.g. [9,33], applying this heuristic to other datasets has not led to similarly impressive results so far [24].…”

Section: Background and Related Workmentioning

confidence: 99%

“…Neither the LSTM on WikiText-2 nor DRN-D-22 on Cityscapes can reach their respective baseline performances after reasonably small batch sizes of about 32 and 64, respectively. Although [24] showed that training on the Amazon Reviews dataset [22] can be done within 4 hours, they tune hyper-parameters heavily. This poses an issue for many practical deployments because these problems are often already slow to train.…”

Section: Diminishing Returns In Rates Of Convergencementioning

confidence: 99%

“…Proponents of large batch size training often argue that the merits stem from its ability to decrease wall-clock training time while maintaining final model performance. Indeed, an enormous amount of work has gone into designing systems that seem to operate under an assumption that equates large batch size training with machine learning at scale [9,15,24].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Golmant,

Vemuri,

Yao

et al. 2018

Preprint

View full text Add to dashboard Cite

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique [5,17]. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for either train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Diminishing Returns In Rates Of Convergencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Golmant,

Vemuri,

Yao

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…An emerging trend in natural language processing is to train a language model in an unsupervised fashion on a large corpus of text, and then to fine-tune the model for a specific task [Devlin et al 2018;Puri et al 2018;Radford et al 2018]. The language model often takes the form of an LSTM [Jozefowicz et al 2016] or a Transformer network [Vaswani et al 2017].…”

Section: Introductionmentioning

confidence: 99%

“…Often times, a practitioner will sacrifice their batch size for a larger, more expressive model. For example, [Puri et al 2018] showed that doubling the dimensionality of an multiplicative LSTM [Krause et al 2016] from 4096 to 8192 forces them to reduce the batch size per GPU by 4×.…”

Section: Introductionmentioning

confidence: 99%

Compressing Gradient Optimizers via Count-Sketches

Spring,

Kyrillidis,

Mohan

et al. 2019

Preprint

View full text Add to dashboard Cite

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25% of the memory used during training (8.6 GB instead of 11.7 GB) by compressing the Adam optimizer in the Embedding and Softmax layers with negligible accuracy and performance loss. For an Amazon extreme classification task with over 49.5 million classes, we also reduce the training time by 38%, by increasing the mini-batch size 3.5× using our count-sketch optimizer.

show abstract

Large batch size training of neural networks with adversarial training and second-order information

Yao¹,

Gholami²,

Daiyaan³

et al. 2018

Preprint

View full text Add to dashboard Cite

The most straightforward method to accelerate Stochastic Gradient Descent (SGD) is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size; however, large batch training usually leads to poor generalization. Existing solutions for large batch training either significantly degrade accuracy or require massive hyper-parameter tuning. To address this issue, we propose a novel large batch training method which combines recent results in adversarial training and second order information. We extensively evaluate our method on Cifar-10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple NNs, including residual networks as well as smaller networks such as SqueezeNext. Our new approach exceeds the performance of the existing solutions in terms of both accuracy and the number of SGD iterations (up to 1% and 5×, respectively). We emphasize that this is achieved without any additional hyper-parameter tuning to tailor our proposed method in any of these experiments. With slight hyper-parameter tuning, our method can reduce the number of SGD iterations of ResNet18 on Cifar-10/ImageNet to 44.8× and 28.8×, respectively. We have open sourced the method including tools for computing Hessian spectrum [2].

show abstract

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Cited by 4 publications

References 17 publications

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Compressing Gradient Optimizers via Count-Sketches

Large batch size training of neural networks with adversarial training and second-order information

Contact Info

Product

Resources

About