“…Oobleck matches or significantly outperforms them for a wide range of model sizes and failure frequencies. [11,15,22,29,30,33,37,38,42,45,47,56]. However, they do not provide fault tolerance out-of-thebox and are orthogonal to Oobleck.…”
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planningexecution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least π + 1 logically equivalent pipeline replicas to tolerate any π simultaneous failures. During execution, it relies on alreadyreplicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after π or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9Γ.
CCS Concepts:β’ Computer systems organization β Dependable and fault-tolerant systems and networks; Distributed architectures; Neural networks.
“…Oobleck matches or significantly outperforms them for a wide range of model sizes and failure frequencies. [11,15,22,29,30,33,37,38,42,45,47,56]. However, they do not provide fault tolerance out-of-thebox and are orthogonal to Oobleck.…”
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planningexecution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least π + 1 logically equivalent pipeline replicas to tolerate any π simultaneous failures. During execution, it relies on alreadyreplicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after π or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9Γ.
CCS Concepts:β’ Computer systems organization β Dependable and fault-tolerant systems and networks; Distributed architectures; Neural networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citationsβcitations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.