Efficient Algorithms for Device Placement of DNN Graph Operators

Tarnawski, Jakub; Phanishayee, Amar; Devanur, Nikhil R.; Mahajan, Divya; Paravecino, Fanny Nina

doi:10.48550/arxiv.2006.16423

Cited by 1 publication

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FlexFlow [22] uses automatic search to discover the best operator parallelization strategy in the graph. Building on this direction of auto-parallelization, these recent papers [39,60] use optimal synthesis and reinforcement learning to find optimized device placement to further improve parallelism without the need for manual intervention. However, these general systems are not specifically designed for highly sparse recommendation models.…”

Section: Related Workmentioning

confidence: 99%

Software-hardware co-design for fast and scalable training of deep learning recommendation models

Mudigere

Hao

Huang

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memoryefficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX , a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.

show abstract

Section: Related Workmentioning

confidence: 99%