2017
DOI: 10.48550/arxiv.1710.03740
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mixed Precision Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
214
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 284 publications
(240 citation statements)
references
References 10 publications
1
214
0
Order By: Relevance
“…We train the upsampling model for 1.6M iterations at batch size 512. We find that these models train stably with 16-bit precision and traditional loss scaling (Micikevicius et al, 2017). The total training compute is roughly equal to that used to train DALL-E.…”
Section: Text-conditional Diffusion Modelsmentioning
confidence: 89%
“…We train the upsampling model for 1.6M iterations at batch size 512. We find that these models train stably with 16-bit precision and traditional loss scaling (Micikevicius et al, 2017). The total training compute is roughly equal to that used to train DALL-E.…”
Section: Text-conditional Diffusion Modelsmentioning
confidence: 89%
“…LibriSpeech [18] Human 960 5760 Common Voice [19] Human 500 3000 Libri-Light [20] Model 60000 360000 Fisher [21] Human Following prior work on scaling Transformer models [1,10,11], we scale the encoder of an E2E VGG-transformer transducer model [12,13] up to 10B parameters. We leverage several techniques to train our transducer models efficiently on GPUs: FairScale model sharding [14], sparse alignment restricted transducer loss [15], mixed-precision training [16], and large batch sizes [17].…”
Section: Data Sourcementioning
confidence: 99%
“…It is well known that small floating point error does not dramatically affect the convergence and final accuracy of ML models [16,20,24,72]. This observation has motivated extensive prior research about training with low or mixed-precision FP operations [20,26,47,51,80,120] and compression or quantization [36,40,45,72].…”
Section: Characteristics Of Training Gradientsmentioning
confidence: 99%
“…It also lacks flexibility: it is tied to specific operations on specific floating-point formats. New ML-specific numeric representations (e.g., FP16 [80,109], bfloat16 [22,31,54], TF32 [87], and MSFP [19]) represent an area of ongoing innovation, and adding support for a new format requires developing and manufacturing a new ASIC -an expensive and time-consuming endeavor. For example, it took four years for Mellanox to release its second version of switches with floating point support [32,33].…”
Section: Introductionmentioning
confidence: 99%