SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 2020
DOI: 10.1109/sc41405.2020.00024
|View full text |Cite
|
Sign up to set email alerts
|

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
299
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 426 publications
(301 citation statements)
references
References 5 publications
1
299
0
1
Order By: Relevance
“…Bi-directionality crucial for protein LMs: In NLP unidirectional (auto-regressive) and bi-directional (autoencoding) models perform on par [12], [93]. In contrast, the bi-directional context appeared crucial to model aspects of the language of life.…”
Section: Protein Lms Top Without Msasmentioning
confidence: 99%
“…Bi-directionality crucial for protein LMs: In NLP unidirectional (auto-regressive) and bi-directional (autoencoding) models perform on par [12], [93]. In contrast, the bi-directional context appeared crucial to model aspects of the language of life.…”
Section: Protein Lms Top Without Msasmentioning
confidence: 99%
“…However, the large size of pretrained models makes this approach exceedingly parameter inefficient. For example, widely-adopted models such as BERT BASE and BERT LARGE have 110M and 340M parameters respectively, while their contemporaries have parameter counts in the billions (Raffel et al, 2020;Shoeybi et al, 2019;Rajbhandari et al, 2019). Storing the fully finetuned models therefore becomes difficult even for a moderate number of tasks.…”
Section: Background: Transfer Learningmentioning
confidence: 99%
“…Nevertheless, the size of recent DNNs has grown far beyond a single GPU's capacity, driving researchers to conduct studies [19], [21] on model parallelism. To support large DNN training with data parallelism, DeepSpeed [38] partitions a DNN's status of parameters and optimizers to each worker, and on-demand transfers the status during the training. DeepSpeed [38] reported a 1.5x network communication volume compared with a typical data parallel system (e.g., Parameter Server).…”
Section: Related Workmentioning
confidence: 99%
“…To support large DNN training with data parallelism, DeepSpeed [38] partitions a DNN's status of parameters and optimizers to each worker, and on-demand transfers the status during the training. DeepSpeed [38] reported a 1.5x network communication volume compared with a typical data parallel system (e.g., Parameter Server). Compared with data parallelism, pipeline parallelism (e.g., VPIPE) incurs much less network communication volume [19], [33] and better scalability during large DNN training [19] (see §6.2).…”
Section: Related Workmentioning
confidence: 99%