GSPMD: General and Scalable Parallelization for ML Computation Graphs

Xu, Yuanzhong; Lee, HyoukJoong; Chen, Dehao; Hechtman, Blake A.; Joshi, Rahul Raghvendra; Krikun, Maxim; Lepikhin, Dmitry; Ly, Andy; Maggioni, Marcello; Pang, Ruoming; Shazeer, Noam; Wang, Shibo; Wang, Tao; Wu, Yonghui; Chen, Zhifeng

doi:10.48550/arxiv.2105.04663

Cited by 19 publications

(34 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Communication is thus required to fetch the input data from other devices. When the tensors are partitioned evenly, i.e., SPMD [52], all devices follow the same collective communication patterns such as all-reduce, all-gather, and all-to-all. Pipeline parallelism.…”

Section: Conventional View Of ML Parallelismmentioning

confidence: 99%

“…Manual combination of parallelisms. Recent development shows the approaches mentioned above need to be combined to scale out today's large DL models [36,52]. The state-of-theart training systems, such as Megatron-LM [36,45], manually design a specialized execution plan that combines these parallelisms for transformer language models.…”

Section: Conventional View Of ML Parallelismmentioning

confidence: 99%

“…Alpa optimizes the intra-operator parallelism plan within a device mesh. Alpa adopts the SPMD-style intra-op parallelism, similar to many existing work [27,50,52], which partitions operators evenly and executes the same instructions across devices, by noticing the fact that devices within a single mesh often have equivalent compute capability. With this simplification, Alpa significantly reduces the space of intra-op parallelism plans, but conveniently expresses and unifies all important parallelism approaches such as data parallelism, ZeRO, Megatron-LM's operator parallelism, and their combinations, which are not covered by some existing automatic operator parallelism systems, such as Tofu [50] and FlexFlow [21].…”

Section: Intra-operator Parallelismmentioning

confidence: 99%

“…After stages, device meshes, and their assignments are decided, at the intra-op level, Alpa compiles each stage against its assigned device mesh, respecting the optimal intra-op parallelism plan output by the ILP solver. The compilation depends on XLA [46] and GSPMD [52], and generates parallel executables for each stage-mesh pair. When needed, the compilation automatically inserts collective communication primitives (see §4) to address the within-mesh communication caused by intra-op parallelism.…”

Section: Parallelism Orchestrationmentioning

confidence: 99%

“…As the final step, Alpa generates static execution instructions to launch the training on clusters. Since each stage has different sets of operators and may locate on meshes with different shapes, in contrast to many SPMD pipeline-parallel training systems [36,52], Alpa adopts an MPMD-style runtime to orchestrate the inter-op parallel execution -Alpa generates distinct static execution instructions for each device mesh.…”

Section: Parallelism Orchestrationmentioning

confidence: 99%

See 4 more Smart Citations

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Zheng¹,

Li²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform handtuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. * Lianmin, Zhuohan, and Hao contributed equally. Part of the work was done when Lianmin interned at Amazon and Zhuohan interned at Google.

show abstract

Section: Conventional View Of ML Parallelismmentioning

confidence: 99%

Section: Conventional View Of ML Parallelismmentioning

confidence: 99%

Section: Intra-operator Parallelismmentioning

confidence: 99%

Section: Parallelism Orchestrationmentioning

confidence: 99%

Section: Parallelism Orchestrationmentioning

confidence: 99%

See 3 more Smart Citations

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Zheng¹,

Li²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

An efficient hardware supported and parallelization architecture for intelligent systems to overcome speculative overheads

Kumar

Singh

Aggarwal³

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

In the last few decades, technology advancements have paved the way for the creation of intelligent and autonomous systems that utilize complex calculations which are both time‐consuming and central processing unit intensive. As a consequence, parallel processing systems are gaining popularity to enhance overall computer performance. Programmers should be able to efficiently utilize available hardware resources with parallelization in an ideal world. Through the automatic parallelization of sequential code, multithreading can be executed without extra supervision. However, a wide range of software dependencies prevents this from being feasible. An architectural framework for speculative parallelization along with an efficient memory analysis and computational algorithms for the code generation are proposed that can provide optimal performance. Furthermore, a suitable support of hardware design as a runtime library to the proposed architectural framework is presented which can be used to recover misspeculated results during execution to minimize speculative parallelism overhead. The implementation makes use of the Low‐Level Virtual Machine compiler infrastructure and is tested on numerous benchmarks, thus making it highly scalable in terms of programming languages and architectures. According to our experimental results, there is significant potential for speedup increase. In comparison to the overall function speedup, that is, geomean speedup of 5.2× approximately when using the proposed architecture without hardware support, the proposed architectural framework and algorithm with hardware support give an average geomean speedup of 7.0× approximately on the given benchmark which is written in C/C++.

show abstract

Pre-trained Language Models

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

This chapter presents the main architecture types of attention-based language models, which describe the distribution of tokens in texts: Autoencoders similar to BERT receive an input text and produce a contextual embedding for each token. Autoregressive language models similar to GPT receive a subsequence of tokens as input. They produce a contextual embedding for each token and predict the next token. In this way, all tokens of a text can successively be generated. Transformer Encoder-Decoders have the task to translate an input sequence to another sequence, e.g. for language translation. First they generate a contextual embedding for each input token by an autoencoder. Then these embeddings are used as input to an autoregressive language model, which sequentially generates the output sequence tokens. These models are usually pre-trained on a large general training set and often fine-tuned for a specific task. Therefore, they are collectively called Pre-trained Language Models (PLM). When the number of parameters of these models gets large, they often can be instructed by prompts and are called Foundation Models. In further sections we described details on optimization and regularization methods used for training. Finally, we analyze the uncertainty of model predictions and how predictions may be explained.

show abstract

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Cited by 19 publications

References 17 publications

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

An efficient hardware supported and parallelization architecture for intelligent systems to overcome speculative overheads

Pre-trained Language Models

Contact Info

Product

Resources

About