Proceedings of the 29th Symposium on Operating Systems Principles 2023
DOI: 10.1145/3600006.3613152
|View full text |Cite
|
Sign up to set email alerts
|

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Insu Jang,
Zhenning Yang,
Zhen Zhang
et al.

Abstract: Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planningexecution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least 𝑓 + 1 logically equivalent pipeline replicas to tolerate any 𝑓 simultaneous failures. During execution, it relies on alreadyreplicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
references
References 35 publications
0
0
0
Order By: Relevance