Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian, Xiangru; Yuan, Binhang; Zhu, Xuefeng; Wang, Yulong; He, Yongjun; Wu, Honghuan; Sun, Lei; Lyu, Hao; Liu, Chengjun; Dong, Xing; Liao, Yiqiao; Luo, Mingnan; Zhang, Congfei; Xie, Jingru; Li, Haonan; Chen, Lei; Huang, Renjie; Lin, Jianying; Shu, C. G.; Xuezhong, Qiu,; Liu, Zhishan; Kong, Dongying; Yuan, Ling; Yu, Huimei; Yang, Sen; Zhang, Ce; Liu, Ji

doi:10.1145/3534678.3539070

Cited by 18 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other works directly utilize native key-value hash table to allow dynamic growth of table size [12,15,20,21]. These implementations builds upon TensorFlow but relies either on specially designed software mechanism [14,15,20] or hardware [21] to access and manage their hash-tables. Compared to these solutions, Monolith's hash-table is yet another native TensorFlow operation.…”

Section: Related Workmentioning

confidence: 99%

“…To support online update and avoid memory issues, both [12] and [20] designed feature eviction mechanisms to flexibly adjust the size of embedding tables. Both [12] and [14] support some form of online training, where learned parameters are synced to serving with a relatively short interval compared to traditional batch training, with fault tolerance mechanisms. Monolith took similar approach to elastically admit and evict features, while it has a more lightweight parameter synchronization mechanism to guarantee model quality.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Monolith: Real Time Recommendation System With Collisionless Embedding Table

Liu¹,

Zou²,

Zou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning frameworks like Ten-sorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design choices. In this paper, we present Monolith 1 , a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend 2 product.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Monolith: Real Time Recommendation System With Collisionless Embedding Table

Liu¹,

Zou²,

Zou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While convolution operation can help improve traditional networks by utilizing strategies including sparse interactions, parameter sharing and equivariant representations, it also poses the problem of high computational overheads and time consumption during the training [9,10]. Furthermore, the number of parameters (million level) and calculations (billion level) grows exponentially with deepening of the network structure, leading to these problems becoming more serious, which aggravates the demand for high-performance training environments [11]. For this purpose, world-renowned hardware companies have developed processing units specialized in training large neural networks, such as Huawei's neural network processing unit, Google's tensor processing unit and ATI's video processing unit, and they have gained significant improvements in large-scale parallel computing compared with generalpurpose processors [12].…”

Section: Introductionmentioning

confidence: 99%

Compact lensless convolution processor for an optoelectronic convolutional neural network

Zhang

Kong

Zhang

et al. 2023

J. Phys. D: Appl. Phys.

View full text Add to dashboard Cite

To our knowledge, an optical 4f system has been widely used as a convolutional layer to perform convolutional computation in free-space optical neural networks (ONNs), which makes ONNs too bulky to be easily applied to miniaturized smart systems. Hence, we propose a compact lensless optoelectronic convolutional neural network (LOE-CNN) architecture in which a single optimized diffractive phase mask (DPM) acts as an analog convolution processor to perform convolution operation without Fourier lens or lenslet array. We demonstrate that the LOE-CNN can be functionally comparable to existing electronic counterparts in classification performance, achieving a classification accuracy of 98.07% and 95% over the Modified National Institute of Standards and Technology (MNIST) dataset in simulation and experiment, respectively, which not only opens new application prospects for free-space ONNs based on compact single-chip convolution processor, but also facilitates the development of ONNs-based smart devices.

show abstract

“…Many recent advances in deep learning have been attributed to significant increases in model size to hundreds of billions of parameters and training on ever-growing datasets [5,31,32,45]. Recent studies suggest that a trillion-parameter model would require at least 2TB of memory simply to store model parameters, and tens or hundreds of TB for training [18,24,37,38,42]. Naturally, scaling large model training has received intense attention over the past few years [3,11,29,45,53].…”

Section: Introductionmentioning

confidence: 99%

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Jang,

Yang,

Zhang

et al. 2023

Proceedings of the 29th Symposium on Operating Systems Principles

View full text Add to dashboard Cite

Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planningexecution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least 𝑓 + 1 logically equivalent pipeline replicas to tolerate any 𝑓 simultaneous failures. During execution, it relies on alreadyreplicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after 𝑓 or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9×. CCS Concepts:• Computer systems organization → Dependable and fault-tolerant systems and networks; Distributed architectures; Neural networks.

show abstract

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Cited by 18 publications

References 16 publications

Monolith: Real Time Recommendation System With Collisionless Embedding Table

Monolith: Real Time Recommendation System With Collisionless Embedding Table

Compact lensless convolution processor for an optoelectronic convolutional neural network

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Contact Info

Product

Resources

About