Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian, Xiangru; Yuan, Binhang; Zhu, Xuefeng; Wang, Yulong; He, Yongjun; Wu, Honghuan; Sun, Lei; Lyu, Hao; Liu, Chengjun; Xiang, Dong; Liao, Yiqiao; Luo, Mingnan; Zhang, Congfei; Xie, Jingru; Li, Haonan; Chen, Lei; Huang, Renjie; Jun, Lin; Shu, C. G.; Xuezhong, Qiu,; Liu, Zhishan; Kong, Dongying; Liu, Yuan; Yu, Hai; Yang, Sen; Zhang, Ce; Liu, Ji

doi:10.48550/arxiv.2111.05897

Cited by 3 publications

(3 citation statements)

References 54 publications

(101 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Embedding tables are commonly used to deal with sparse features in recommendation models [1,2,3,4,5,29,30,31]. However, the extremely large embedding tables are often the storage and efficiency bottlenecks [6,7,8,3,6,9,10,11,32]. To our knowledge, the only two studies that target the embedding table placement problem are RecShard [27] and our previous work AutoShard [33].…”

Section: Related Workmentioning

confidence: 99%

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Zha

Liu

Bhushanam

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems named embedding table sharding, i.e., how we should partition the embedding tables to balance the costs across devices, which is a non-trivial task because 1) it is hard to efficiently and precisely measure the cost, and 2) the partition problem is known to be NP-hard. In this work, we introduce our novel practice in Meta, namely AutoShard, which uses a neural cost model to directly predict the multi-table costs and leverages deep reinforcement learning to solve the partition problem. Experimental results on an open-sourced large-scale synthetic dataset and Meta's production dataset demonstrate the superiority of AutoShard over the heuristics. Moreover, the learned policy of AutoShard can transfer to sharding tasks with various numbers of tables and different ratios of the unseen tables without any fine-tuning. Furthermore, AutoShard can efficiently shard hundreds of tables in seconds. The effectiveness, transferability, and efficiency of AutoShard make it desirable for production use. Our algorithms have been deployed in Meta production environment. A prototype is available at https://github.com/daochenzha/autoshard CCS CONCEPTS• Computing methodologies → Reinforcement learning; Machine learning approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Zha

Liu

Bhushanam

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…We on the other hand implement only synchronous training. A very recent work by [27] introduces a new hybrid sync-async algorithm to train recommender models, unlike this work we only focus on synchronous training. [47] proposes methods for improving data processing while training recommender systems.…”

Section: Related Workmentioning

confidence: 99%

BagPipe: Accelerating Deep Recommendation Model Training

Agarwal¹,

Zhang²,

Venkataraman³

2022

Preprint

View full text Add to dashboard Cite

Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging primarily because they consist of billions of embedding-based parameters which are often stored remotely leading to significant overheads from embedding access. By profiling existing DLRM training, we observe that only 8.5% of the iteration time is spent in forward/backward pass while the remaining time is spent on embedding and model synchronization. Our key insight in this paper is that access to embeddings have a specific structure and pattern which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with almost 1% of embeddings represent more than 92% of total accesses. Further, we observe that during training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insight, we propose Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We designed an Oracle Cacher, a new system component which uses our lookahead algorithm to generate optimal cache update decisions and provide strong consistency guarantees. Our experiments using three datasets and two models shows that our approach provides a speed up of up to 6.2x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.

show abstract

“…Among DNNs, convolutional neural networks (CNNs), one of the representative algorithms of deep learning, are a specialized kind of feedforward neural network with deep structure and convolution computation, and have been tremendously successful in computer vision applications such image recognition and image segmentation because of its smart use of strategies including sparse interactions, parameter sharing and equivariant representations [5,6]. Currently, deep CNNs are driven by high-performance processors such as graphics processing unit (GPU) and tensor processing unit (TPU) for performing a large number of computations such as addition and multiplication [7], which need huge computation time and energy resources. However, as the Moore's law approaches the limits of physics, electronic chips will be hard to keep up with performance growth of the artificial intelligence.…”

Section: Introductionmentioning

confidence: 99%

Compact lensless optoelectronic convolutional neural network for image classification

Zhang,

Da,

Kong

et al. 2023

Fourteenth International Conference on Information Optics and Photonics (CIOP 2023)

View full text Add to dashboard Cite

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Cited by 3 publications

References 54 publications

AutoShard: Automated Embedding Table Sharding for Recommender Systems

AutoShard: Automated Embedding Table Sharding for Recommender Systems

BagPipe: Accelerating Deep Recommendation Model Training

Compact lensless optoelectronic convolutional neural network for image classification

Contact Info

Product

Resources

About