ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Rajbhandari, Samyam; Rasley, Jeff; Ruwase, Olatunji; He, Yuxiong

doi:10.1109/sc41405.2020.00024

Cited by 426 publications

(301 citation statements)

References 5 publications

Supporting

Mentioning

299

Contrasting

Unclassified

Order By: Relevance

“…Bi-directionality crucial for protein LMs: In NLP unidirectional (auto-regressive) and bi-directional (autoencoding) models perform on par [12], [93]. In contrast, the bi-directional context appeared crucial to model aspects of the language of life.…”

Section: Protein Lms Top Without Msasmentioning

confidence: 99%

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Elnaggar

Heinzinger

Dallago

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

1,050

1,154

View full text Add to dashboard Cite

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

show abstract

Section: Protein Lms Top Without Msasmentioning

confidence: 99%

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Elnaggar

Heinzinger

Dallago

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

1,050

1,154

View full text Add to dashboard Cite

show abstract

“…However, the large size of pretrained models makes this approach exceedingly parameter inefficient. For example, widely-adopted models such as BERT BASE and BERT LARGE have 110M and 340M parameters respectively, while their contemporaries have parameter counts in the billions (Raffel et al, 2020;Shoeybi et al, 2019;Rajbhandari et al, 2019). Storing the fully finetuned models therefore becomes difficult even for a moderate number of tasks.…”

Section: Background: Transfer Learningmentioning

confidence: 99%

Parameter-Efficient Transfer Learning with Diff Pruning

Guo

Rushton²,

Kim

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

102

View full text Add to dashboard Cite

The large size of pretrained networks makes them difficult to deploy for multiple tasks in storage-constrained settings. Diff pruning enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific "diff" vector that extends the original pretrained parameters. This diff vector is adaptively pruned during training with a differentiable approximation to the L 0 -norm penalty to encourage sparsity. As the number of tasks increases, diff pruning remains parameter-efficient, as it requires storing only a small diff vector for each task. Since it does not require access to all tasks during training, it is attractive in on-device deployment settings where tasks arrive in stream or even from different providers. Diff pruning can match the performance of finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task and scales favorably in comparison to popular pruning approaches.

show abstract

“…Nevertheless, the size of recent DNNs has grown far beyond a single GPU's capacity, driving researchers to conduct studies [19], [21] on model parallelism. To support large DNN training with data parallelism, DeepSpeed [38] partitions a DNN's status of parameters and optimizers to each worker, and on-demand transfers the status during the training. DeepSpeed [38] reported a 1.5x network communication volume compared with a typical data parallel system (e.g., Parameter Server).…”

Section: Related Workmentioning

confidence: 99%

“…To support large DNN training with data parallelism, DeepSpeed [38] partitions a DNN's status of parameters and optimizers to each worker, and on-demand transfers the status during the training. DeepSpeed [38] reported a 1.5x network communication volume compared with a typical data parallel system (e.g., Parameter Server). Compared with data parallelism, pipeline parallelism (e.g., VPIPE) incurs much less network communication volume [19], [33] and better scalability during large DNN training [19] (see §6.2).…”

Section: Related Workmentioning

confidence: 99%

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Zhao

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU's physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. VPIPE is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. VPIPE has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. VPIPE improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4%-463.4% and 24.8%-291.3% on various large DNNs and training settings.

show abstract

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Cited by 426 publications

References 5 publications

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Parameter-Efficient Transfer Learning with Diff Pruning

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Contact Info

Product

Resources

About