2018
DOI: 10.48550/arxiv.1808.01371
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Abstract: Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets [1], then transfer the knowledge gained from these models to a variety of tasks [2]. Following [3], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks.By utilizing mixed precision arithmetic and a 32k batch size distributed across 128 NVIDIA Tesla V100 GPUs, we are able to train a character-level 4096-dimension multiplicative LSTM (mL… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…Whereas several groups have successfully used this rule to train on the ImageNet dataset in under an hour, e.g. [9,33], applying this heuristic to other datasets has not led to similarly impressive results so far [24].…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Whereas several groups have successfully used this rule to train on the ImageNet dataset in under an hour, e.g. [9,33], applying this heuristic to other datasets has not led to similarly impressive results so far [24].…”
Section: Background and Related Workmentioning
confidence: 99%
“…Neither the LSTM on WikiText-2 nor DRN-D-22 on Cityscapes can reach their respective baseline performances after reasonably small batch sizes of about 32 and 64, respectively. Although [24] showed that training on the Amazon Reviews dataset [22] can be done within 4 hours, they tune hyper-parameters heavily. This poses an issue for many practical deployments because these problems are often already slow to train.…”
Section: Diminishing Returns In Rates Of Convergencementioning
confidence: 99%
See 1 more Smart Citation
“…An emerging trend in natural language processing is to train a language model in an unsupervised fashion on a large corpus of text, and then to fine-tune the model for a specific task [Devlin et al 2018;Puri et al 2018;Radford et al 2018]. The language model often takes the form of an LSTM [Jozefowicz et al 2016] or a Transformer network [Vaswani et al 2017].…”
Section: Introductionmentioning
confidence: 99%
“…Often times, a practitioner will sacrifice their batch size for a larger, more expressive model. For example, [Puri et al 2018] showed that doubling the dimensionality of an multiplicative LSTM [Krause et al 2016] from 4096 to 8192 forces them to reduce the batch size per GPU by 4×.…”
Section: Introductionmentioning
confidence: 99%